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PREFACE 


From several years’ experience in teaching classes in statistics and 
giving advice at various times to experimentalists, I have come to the 
conclusion that there is a distinct need for more than one type of text- 
book. On the one hand there are many who are interested only in 
knowing something of the theory and principles. In this class we find 
students who are endeavoring to obtain a broad knowledge of all sub- 
jects related to science and art, practicing technicians such as doctors 
of medicine and technical advisers in agriculture, and administrators of 
research activities. It would be idle to set students of this type to 
work on laborious practical examples. It would probably discourage 
them at the start, and by absorbing time would reduce the possibility of 
teaching them some of the very attractive philosophical phases of the 
subject. In a maze of calculations the principles might be lost sight of 
completely, and the student would emerge with a technique for mechan- 
ical operations and no ability to solve actual problems. At the begin- 
ning *it is not training in actual methods that is required, but the build- 
ing up of a sound knowledge of fundamental principles. 

On the other hand, we have an increasing number of students who, 
having had some elementary training in statistics and some experience 
in research work, come to the point finally of requiring a practical 
knowledge of methods of analysis and some facility in the devices of 
calculation. There is no denying the fact that two or three years spent 
in studying the principles and theory of statistics will not fit the student 
to solve practical problems. To do so is to ignore the many complica- 
tions that are involved and that training in facility is necessary in order 
that statistical computations may be attacked with determination and 
completed in a reasonable length of time. One of the objections very 
often raised to the use of statistical methods is the time necessary to do 
the routine work. Frequently this sort of thing can be attributed to 
insufficient training in the actual methods that should be employed and 
a lack of organization of the work. 

The basis of this book, therefore, is the suppl3ring of a textbook in 
statistics for students who have passed the elementary stage; who have 
studied a fair amount of theory and principles and now wish to equip 
themselves for actual statistical work in their own field of research 
activities. The experiment station agronomist, the cereal chemist, the 
plant breeder, and the economic entomologist are all examples of research 
workers who require a practical knowledge of statistical methods, and 
undoubtedly there are many others in the same class. It has been my 
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experience that to acquire this knowledge the student must work 
through a comprehensive series of actual examples, and these should 
not be miniature examples as they are likely to give him a wrong impres- 
sion of what will actually be required of him at a later date. Most of 
the various examples and exercises in this book are therefore of actual 
siie, but every effort has been made to keep them within such limits as 
will enable the student to work through a representative set in one 
academic year. 

This is not to say that a course in statistical methods should ever be 
given without emphasis on principles, and this applies particularly to 
the principles of experimental design. When 8tu(l3ring practical meth- 
ods, the opportunity is prime for the student to acquire a solid ground- 
ing in this important phase of the subject. The discussions in the 
greater part of the book, therefore, are worked out so that they have a 
direct bearing on the principles of the design of experiments. The first 
half, for example, while containing material that involves a repetition 
of elementary work that has already been covered, is nevertheless written 
so that, in reviewing, the student is brought into contact immediately 
with the structure of actual experiments. Also in this portion of the 
l)Ook are certain routine calculations which are designed mainly to give 
the student some facility in calculation before he comes to the heavier 
problems in the latter part. 

There are many to whom I owe thanks in the preparation of this 
book, but in the first place I must acknowledge a very great debt to 
Professor R A. Fisher, who has been mainly responsible for the develop- 
ment of the methods that are set forth. Furthermore, he has been very 
generous of his own time in explaining how new problems may be solved 
and in clearing up doubts os to the exact application of previously estab- 
lished methods. I wish also to thank the staff of the Statistical Labora- 
tory at Ames, Iowa, for advice and suggestions, especially Dr. Q. W, 
Snedccor, who in addition has given me permission to use, wholly or in 
part, any of the tables or material in his excellent new textbook, ‘'Sta- 
tistical Methods Thanks are due to many who have called attention 
to errors in the preprint edition, and to ways in which the explanations 
and examples could be improved. This applies particularly to my stu- 
dents, who have taken a special interest in suggesting improvementa of 
this kind. They have also taken a particular interest in checking the 
calculations in order that the book should be as nearly perfect as possi- 
ble in this respect. In typing the manuscript I must acknowledge the 
untiring assistance of Misses E. J. Stewart and M. G. White. 

C. H. OOULDBN. 


February, 1939. 
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METHODS OF STATISTICAL ANALYSIS 


CHAPTER I 
INTRODnCnON 

1. The Logic of Statistical Methods. Appl 3 riiig statistical methods 
to experimental work involves the use of certain lo^cal ideas approjniate 
to experimental procedure. The problems of statistics are, therefore, not 
entirely mathematical problems; in fact they are very largely problems 
based on the technique and requirements of the research worker. This 
, important point has not always been clearly understood and hence we 
find, in the history of the development of statistical methods, various 
attempts to solve the problems of the experimentalist by the application 
of purely mathematical methods of reasoning and derivation. Thus we 
find prodigious attempts being made to apply the method of inverse prob- 
ability to the testing of the significance of results obtained in experi- 
ments This theory has to do with the evaluation of the probability of 
the occurrence of certain specified events on the basis of what has hap- 
pened in some previous event. For example, if 8 balls are drawn from 
an urn containing black and white balls, and are found to consist of 3 
white and 5 black balls; to derive from this result an exact statement of 
the probability of obtaining a white ball in drawing another single ball 
is a problem in inverse probability. Everyone will agree that, on the 
basis of the ratio of white to black balls in the sample drawn, in drawing 
another ball one's expectation tends towards black, but very few will 
agree that this expectation can be put in the form of an exact statement 
of mathematical probability. On first thought, one might be inclined 
to think that this type of problem is the same as the statistical one of 
taking samples and reasoning from these samples to the populations 
from which they were drawn. We shall see, however, that there is a 
very essential difference between the two situations; that to regard 
these two situations as the same is merely to misunderstand the true 
nature of the methods of obtaining new information by experimental 
methods. To illustrate these points in further detail we shall follow 
through the procedure of operating a very simple experiment, in which 
the statistical method will arise as a natural consequence of the efforts 
of the investigator to get the most out of his experiment. 
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2. A Simple Experiment in Identifying Varieties of Wheat This 
h 3 rpolhetical experiment is modelled after the famous tea-tasting 
experiment described by R. A. Fisher (1), but in some respects the pro- 
cedure is simplified. Fiber’s hypothetical experiment will undoubtedly 
remain as a classic in statistical literature, and after following through 
the experiment described here the student will do well to make a similar 
study of the tea-tasting experiment as it discusses certain aspects of this 
type of problem that cannot be presented here. 

A wheat expert claims that, if he is presented with grain samples of 
two particular varieties which we shall designate as A and B, he can 
distinguish between them. He does not claim the ability to identify 
either one of the varieties, if they are presented to him separately, and 
further there is no special mention of an ability to differentiate between 
these samples at all times and under all conditions with perfect accuracy. 
The claim is for a certain power of differentiation, and we must proceed 
in the planning of the experiment accordingly; that is, wc must plan 
the experiment in such a way that any reasonable power of differentia- * 
tion possessed by the operator will be demonstrated. With this 
knowledge we can proceed to set up the experiment. 

It will be obvious with a little study that, in order to plan the experi- 
ment correctly, it will be necessary to anticipate the possible results. 
Suppose that we presented the operator with only one pair of samples 
and he classified them correctly. Without any knowledge whatever of 
wheat varieties he could, by pure guesswork, name the varieties cor- 
rectly in 50 per cent of the cases. This follows from the fact that there 
are only 2 ways of classifying them, and if the operator has no power of 
differentiating them, these 2 wa 3 r 8 are equally likely. Thus in about 
half of the cases he would place them correctly, and in the remainder of 
the cases incorrectly. Our conclusion must be that 1 pair of samples 
would not be suflficient to produce a clear-cut result, regardless of the 
efficiency or the inefficiency of the operator. What will be the effect of 
increasing the number of pairs of samples? Obviously, the operator 
would be much more unlikely to place several pairs of samples correctly 
than he would just 1 pair. Can this statement be put in more definite 
terms? Let us assume that 6 pairs are being used and see if we can 
calculate the probability of a correct result, or, in other words, the 
proportion of the cases in which the operator, without any power of 
differentiation of the samples, could be expected to reach a correct 
placing. If there are 6 pairs of samples, each pair may be placed either 
rightly or wrongly, so that there are just 7 different kinds of results. 
These are: 6 right, 5 right, 4 right, 3 right, 2 right, 1 right, and none 
right. The pairs may be thought of as being presented to the operator 
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one at a time, so there are 2 ways of placing the first pair (right or 
wrong), 2 ways of placing the second pcdr, and so forth for all the pairs. 
Each result for a pair may occur with either result for another pair, so 
that for 2 pairs we would have 2X2 possible combinations of placings. 
These are: both right; first pair ri^t and second pair wrong; second 
pair right and first pair wrong; and both wrong. Continuing with this 
reasoning, it turns out that for 3 pairs the possible number of combine* 
tions of placings is 2 X 2 X 2; and, finally, for 6 pairs the total number 
is 2^ » 64. If now the operator places all 6 pairs of samples correctly, 
we are in a position to place an evaluation on this result. There is only 
1 way of placing all pairs correctly, so that if the operator has no knowl- 
edge whatever of wheat varieties he would be expected to place them 
correctly in only 1 out of 64 trials. This would be a rather odd chance, 
and we would therefore be inclined, in the event of a successful placing, 
to attribute it to the ability of the operator in differentiating the 
varieties. Another way to regard this is to consider the consequences of 
adopting as a standard, in the examination of a large number of opera- 
tors, that all pairs must be placed correctly. Then in 1 out of 64 cases 
we could be expected to attribute to the operator a power of differentiat- 
ing the varieties that he did not actually possess. This would seem to be 
a fairly safe standard. In fact it would undoubtedly be argued from 
the standpoint of the operators being tested that the standard was 
much too high. In general practice, it is usual to adopt a ratio of 1/20 
as an arbitrary level for discriminating between real and chance effects. 
That is, an event is not regarded as significant unless it would only 
occur by chance variation in not more than 1 out of 20 trials. 

We now have to consider the interpretation that would be made if 
the operator were to obtain such a result as 5 pairs right and 1 pair 
wrong. In the above case there was only 1 way of placing 6 of the 
pairs correctly, but the situation is different now in that any one of the 
6 pairs may be the one that is incorrectly placed, making a total of 6 
ways, out of the grand total of 64, in which the samples may be placed 
5 right and 1 wrong. Then, in considering the experiment from the 
standpoint of the possibility of its indicating a power of differentiation 
on the part of the operator, we must also take into consideration the 
number of ways of placing 6 pairs correctly. That is, we must enumer- 
ate the number of wa 3 rs in which the operator can place 5 pairs of samples 
correctly, o^* any other result more favorable to his claim. This makes 
a total of 1 + 6 = 7 oui of 64 ways in which such a result or one more 
favorable to the operator could occur, and if the operator has no power 
of differentiation this result will be expected to occur in just that pro- 
portion of the cases. In approximate figures the ratio 7/64 is equal to 
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1/9; and we note that this is larger than the ratio 1/20, which, as pointed 
out above, is accepted as a general level of significance. To accept 
the ratio of 1/9 as indicating a power of differentiation would be to 
take the risk of being wrong in 1 out of 9 similar trials, and this would 
probably be too great a risk for most investigators to accept. It might, 
however, be taken as a sufficient indication to justify further experi- 
mentation. 

It will be found convenient, in experiments of this type, to set up in 
the form of a table all the possible results with the corresponding 
number of ways in which each can occur. Another column of the table 
may be used to show the ratio that we have taken above to indicate the 
significance of each result. The figures for this experiment are given in 
Table 1. Why do we not give more values in the third column? 

TABLE 1 

PossiBLB Results, Number of Combinations, and Ratio of Significance, ' 
FOR A Simple Experiment in Diffebentiating Six Pairs of Samples 


Possible Results 

No of Combinations 

■ 

Ratio of Significance 

6 right 

0 wrong 

1 

1/64 

5 '' 

1 “ 

6 

7/64 

4 “ 

2 

15 

22/64 

3 

3 

20 


2 “ 

4 “ 

15 


1 “ 

5 “ 

6 


0 “ 

6 ‘‘ 

1 


Total 

64 

1 


The procedure in this simple experiment may now appear to be 
quite clear and apparently straightforward in every respect. The 
reader will then be surprised to learn that we have been guilty of a very 
serious omission. We have said that, if the operator has actually no 
power of differentiation, the 64 ways of arranging the pairs are all 
equally likely to occur. Suppose now that the samples are presented to 
the operator in pairs with variety A to his left hand and variety B to 
his right hand. On the off chance that there may be such a systematic 
arrangement of the pairs, the operator decides to guess this order and 
then adhere to it throughout the experiment. The result is that the 
most probable arrangements are 6 right, or 6 wrong, and our theory as 
to the probable frequency of the different possible results is completely 
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broken down. Another possibility that we have omitted to consider 
so far is that the 2 samples may show differences as to weight or quality 
which are actually quite independent of the variety characteristics. 
Here again the operator may, by guessing, obtain a result that is either 
all wrong or all right. We could go on and point out a number of factors 
that would tend to upset our calculations, and in the end the reader 
might despair as to the possibility of carrying through any experiment 
that would lead to valid conclusions. Why not take into consideration 
such factors as we have mentioned and work out the theoretical fre- 
quencies of the different combinations accordingly? A little thought 
will show that this is quite impossible. The vagaries of the minds of 
operators, for example, in taking advantage of certain orderly arrange- 
ments of the pairs, would be quite beyond the possibility of definite 
enumeration. The situation is not hopeless, however, as there is 
always at hand an extremely powerful method of overcoming this 
difficulty. The method is to arrange all factors that may enter into 
the results, completely at random. Thus, in presenting the pairs to the 
operator, a random arrangement would be followed that would be 
determined beforehand by throwing coins, drawing cards, or from a 
book of random numbers. It could then be stated with absolute con* 
fidence that, on the hypothesis that the operator has no knowledge of 
differentiating the samples, all possible arrangements would be equally 
likely to occur. It would be possible, for example, to use different 
colors of trays as containers for the samples. In each pair 1 tray might 
be red and 1 blue, and, if the varieties are assigned to the trays at 
random, it will still be true that all possible arrangements are equally 
likely. Of course a word of caution is needed here. Different colored 
trays, or any other disturbing influence on the ability of the operator to 
differentiate the samples, are not recommended, as they tend to reduce 
the efficiency of the test ; but at the same time if such factors arc properly 
randomized they do not affect the validity of the test of significance. 

3. Defining Some Statistical Terms. In describing our simple 
experiment, statistical terms were avoided as much as possible. Such 
terms are, however, a kind of shorthand and will be found very convenient 
as we proceed to the consideration of more intricate problems. The 
6 pairs of samples of grain constitute in themselves a sample in the true 
statistical sense. We were not particularly interested in what the 
operator did with the 6 pairs except in so far as it indicated his ability 
to differentiate the varieties in general. In other words, we were trying 
to obtain an estimate of what would happen if he were presented with a 
very large group of such pairs. This large group containing an indefinite 
number of pairs might be said to constitute the popvlation that we are 
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sampling. The general problem of statistics, therefore, is the eMmotion 
of values for populations by means of determinations made on samples 
drawn at random from these populations. Assuming that the final 
result of our experiment was 5 pairs of samples placed correctly, the best 
estimate we would have for what our operator might do with a very large 
sample is that he would place ^ of the pairs correctly. This value is the 
mean number of successful placings that the operator would make in a 
population of similar pairs. A value such as this, calculated from a 
sample, is said to be a aiaiisitc. The population value of which the 
statistic is an estimate is referred to as a parameter. Statistics are sub- 
ject to variability in that we will get different results with different 
samples. The populations sampled are regarded for convenience as 
being infinite ; and therefore for any one variable, such as the number of 
successful placings, there is only 1 value of the parameter. 

In all experiments there is a h 3 rpothesis to be tested. It will have 
been noted that in the description of the simple experiment we repeatedly 
used the words “if the operator has no power of differentiation.” This 
points to the fact that the hypothesis we were testing was just that. In 
statistical parlance our hypothesis is now, owing to the pertinent sug- 
gestion of Professor Fisher (1), referred to as the null hypothesik. This 
null hypothesis was the basis for the calculation of the number of ways 
out of the total that certain results would be obtained, it being assumed, 
owing to randomization of the experiment, that all the possible ways 
were equally likely. 

4. Summary of Principles. We have now worked through an actual 
experiment, which, although it was extremely simple, has introduced us 
to the main principles of the statistical method and has allowed us to 
obtain an easy introduction to many of the common statistical terms. 
It will be convenient after this discussion to return to some of the gener- 
alizations of Section 1. 

It will have been noted that the logic employed in tests of significance 
is clearly that of the experimentalist. This is true whether or not the 
experimenter has any knowledge of mathematics. Always, if he is 
critical of his results, he asks himself whether or not they could have 
arisen as a chance variation, and on this basis arrives at some conclusion 
as to their significance. The statistical method, therefore, does not 
introduce anything new in this sense, but merely supplies him with the 
technique for planning his experiment so that it is justifiable to ask such 
a question, and then furnishes him with a method of measuring the 
confidence to be placed in the findings. 

The results from one sample are not used to obtain a statement as 
to the probability of obtaining a given result in drawing another sample. 
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but they are used to obtain an estimate of the population from which the 
sample was drawn. 

A test of sifpiificance is, essentially, the use of the data provided by 
the sample to test any hypothesis that may be set up. In such tests we 
do not always realize that a hypothesis is involved, but nevertheless this 
is true. When we ask the question, “Is my result due to some real 
effect or to a chance variation?” we can answer this question only by 
setting up the hypothesis that there is no effect^ and determining whether 
or not the results agree or disagree with the hypothesis. 

The mathematical derivations involved in statistical tests arise 
from attempts lo state the proportion of cases, according to a given 
hypothesis, in which the results obtained will occur. Thus, in the 
exp)eriment described above, the hypothesis was that the operator had 
no power of differentiating the varieties; and oh this basis we inquired as 
to the proportion of cases in which a result of 6 right would occur. The 
order in which the samples were presented having been randomized, it 
was possible to state that all placings were equally likely; and hence we 
were able to derive by strictly mathematical methods the proportion of 
cases in which a given placing would occur. 

6. The Functions of Statistical Analysis. The chief functions of 
statistical analysis as applied to experimental procedure may now be 
enumerated as follows: 

(a) To provide a sound basis for the formulation of experimental 
designs. 

(b) To provide methods for making tests of significance and 
trustworthy estimations of the magnitude of the effects indicated 
by the results 

(c) To provide adequate methods for the reduction of data. 

The discussion of the previous sections will have given a reasonably 
clear picture of the manner in which the principles of statistics are made 
use of in designing experiments. Since this is the most recent develop- 
ment in this field, it is natural that it is with respect to experimental 
design that the beginner is most likely to err. Frequently an elementary 
knowledge of statistics, consisting merely of an outline of the facts of 
variability and the various methods of measuring this variability, is 
taken as a sufficient knowledge for applying statistical methods to 
experimental work. The results of this practice are often disastrous. 
It is the reason why the consulting statistician is frequently presented 
with a set of data collected from an experiment which has been very 
badly designed. At the best, in such an experiment, there will be a loss 
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of precision and information; but in addition there may be a decided 
bias in the results and as a consequence the whole or at least a part of the 
data may have to be discarded. It is not exaggeration, therefore, to 
state that to the experimentalist a study of statistical methods is futile 
unless he endeavors to apply these methods not only to the analysis of 
data but also to the structure of proposed experiments. 

The necessity for tests of significance has already been dealt with, 
but very little emphasis in the above discussion was placed on methods of 
estimation. It was pointed out, however, in the hypothetical example, 
that, if the operator’s result was 5 ri^t placings out of a possible 6, this 
would have to be taken as the best estimate available of the proportion of 
correct placings the operator could be expected to make if presented with 
a large series of samples. Obviously the experiment was so small that 
this may not be very close to the prop>ortion that the operator would 
actually accomplish, and hence in this respect the experiment was not 
sufficiently extensive. The methods of statistics are concerned very 
vitally, therefore, with methods of estimation; and here again we cannot 
avoid noting the importance of experimental design, in that by careful 
design we can very largely determine beforehand the accuracy with which 
a particular estimate can be made. 

The necessity for the reduction of data is perfectly obvious, but it 
may not be clear as to the various methods employed in statistics for 
bringing this about. It is impossible to list these here, but we can 
classify them into three general groups: viz., tables, graphs, and statistics. 
The tables are usually prepMured first, and from these we draw graphs to 
illustrate the main features of the data, and calculate statistics. The 
statistics are single expressions such as the mean or average which 
express the general characteristics of the samples studied. 
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THE ARITHMETIC MEAN AND STANDARD DEVIATION- 
FREQUENCY TABLES AND THEIR PREPARATION 


1. The Arithmetic Mean. This is our first example of a statistic. It 
is called a statistic because we regard it in statistical practice as a value 
calculated from a sample, and an estimate of the mean of the population 
from which the sample was drawn. Values for the means of samples will 
be expected to vary from sample to sample, and are therefore not essen- 
tially different from individual variates in that respect. It is for this 
reason that it is not consistent terminology to speak of the mean or any 
other statistic calculated from a sample as a constant. The only con- 
stant v^ues in statistical theory and practice are the values representing 
the infinite populations from which the samples are drawn. These, as 
we shall see later, are usually referred to in modern statistical litera- 
ture as parameters. 

It is often said of the arithmetic mean that it is the best single value 
that can be applied td the sample as a whole. Thus we find that the 
agronomist refers to the average yield of a variety, and not to the indi- 
vidual yields of a series of plots. Many other instances of this kind coidd 
be cited; in fact, it is an everyday usage and needs no further explana- 
tion. 

For a sample of N variates where represents any one variate, the 
mean z is given by: 

Xi + X2 + X3 + • ■ ■ + X, + • • • + X„ 


which for the sake of abbreviation is written: 


N 


( 1 ) 


If the values for three variates are 6, 8, and 1, the mean is obviously: 


6 + 8 + 1 15 ^ 

-__.5 


3 
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Using the short formula means simply that the summation of the three 
quantities is understood, and, instead of writing out all the values and 
connecting them with plus signs, we merely write 15/3 5. According 

N 

to strict mathematical usage, Z(x) should be written 'S(x), to show that 

1 

N values are summated, but the simpler form may be used when the 
number of summations is obvious. 

One of the most interesting properties of the mean is that the sum of 
the deviations of all the individual variates from the mean is aero. 
Again representing an individual variate by Xi, an individual deviation 
from the mean will be (zt — x). Then summing all these we get: 

S(z - z) = (zi - z) + (Z2 - + . . . + (Xn - z) 

(zi + Z2 + . . . + Zn) — Nx 

N(xi + Z 2 + . ■ ■ + Xn) 

N 

2(z — i) = 0 

Using the summation sign to shorten the algebra we would have 

2(z — z) 

And since 

Nx 

It is again clear that 

2(z — z) = 0 

2, The Standard Deviation. In using the mean of a sample to 
represent the sample as a whole, it must occur to us that the reliabiUty 
of this method will depend on the degree of variation among the indi- 
vidual variates that make up the sample. If there is no variation the 
mean would represent the whole set perfectly; but as the variation 
becomes greater the single value of the mean is less and less descriptive 
of the entire group, and it becomes more and more necessary in order to 
describe the sample completely that we have some measure of variability. 
The average deviation from the mean might suggest itself, but we have 
seen that the sum of the deviations from the mean is zero, and from this 
it follows that the mean deviation is also zero. For this reason the sta- 
tistic that has been adopted as a measure of variability is the root mean 
square deviation, commonly known as the standard deviation. The 


= 2(z) - 2(z) = 2(z) - Nx 

_ NXjx) 

N 


And since 

Nx = 

It is clear that 
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formula for the standard deviation, which is usually represented by the 
Greek letter sigma (v), is: 




2(z - i)* 


N 


( 2 ) 


The direct method of calculating the standard deviation is to find all the 
deviations from the mean, square them, summate, divide by N, and 
then extract the square root. For example, if we have the three figures 
6, 8, and 1, for which the mean is 5, the standard deviation would be: 




12 + 32 + 42 



When there are more variates in the sample, and especially when the 
deviations contain decimal figures, a much shorter method can be used. 
The main part of the work is to find the sum of squares of the deviations, 
and it can be shown very easily that: 


Z(z - i)2 = 2(x2) 


[ Sfa )]2 

AT 


(3) 


Applying this to our miniature example we have: 

2(x - x)2 = (62 + 82 + 12) - 152/3 = 26 

This formula is especially useful for machine calculation and is now used 
almost exclusively in statistical laboratories. 

We now have to consider a pK)int which is very important in the prac- 
tical application of statistical methods, and one over which there is often 
a great deal of confusion. It was pointed out above that the mean of a 
sample is taken as the best possible estimate of the mean of the parent 
population. This practice of estimating values for parent populations is 
the main object of calculating values for samples. With a little thought 
this point should be quite clear. We determine the reaction of a crop to 
a given fertilizer on a sample of plots which may not be more than 6 to 10 
in number. It cannot be stated, even by the wildest stretch of the 
imagination, that we are primarily interested in the reaction to the 
fertilizer on those 6 to 10 plots. What we are attempting to find out is 
the general reaction to the fertilizer under fanning practice, and hence 
we must picture a very large population of plots for the mean reaction of 
which w'e are trying to obtain an estimate. If we let this population, for 
purposes of clarity of thinking, be regarded as infinite, it follows that the 
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mean and the standard deviation for this population are fixed values and 
hence we call them parameters. If the mean of the parent population 
is denoted by m, then x, the mean of the sample, is an estimate of the 
parameter m. Similarly if <r is the standard deviation of the parent 
population, the value which we calculate from the sample must also be 
the best possible estimate of tr. Actually this estimate is not the root 
mean square deviation that we have defined above. This arises from 
the fact that, if m is the mean of the parent population, the best estimate 


of a is: 



(x — m)2 

If 


but since we do not know m we use x instead, and it can be shown by a 
simple algebraic derivation that the best estimate of a is given by: 


-4 


2(x - x)2 
AT - 1 


(4) 


wherein we put this expression equal to s in that it is not a but the best 
possible estimate of <r. We keep to this symbolism throughout in order 
to distinguish the standard deviation calculated from a sample from the 
true value which is a parameter of the parent population. The divisor 
(AT — 1) is known as the number of degrees of freedom available for 
estimating the standard deviation. We shall learn more of this term in 
later chapters. 

3. Standard Deviation of a Sample Mean. If we take a scries of 
samples and determine a mean for each one, it is obvious that the means 
for these samples will vary from sample to sample, and that the degree 
of variation among these means will be related to the degree of variation 
among the individual variates. If one particular sample is taken, the 
exact relation is given by the equation: 





(5) 


where s^ is the standard deviation of the mean of the sample, s is the 
standard deviation for the sample as a whole, and N is the number in the 
sample. The standard deviation of a mean is therefore inversely 
proportional to the square root of the number in the sample. 

4. The Frequency Table, Tliis is a table which showSj^ for the 
sample of variates studied, the frequencies with which they fall into 
certain clearly defined classes. If the sample is very small the frequency 
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table may not be neceasaryi and even if prepared may not mean very 
much; but for moderately large samples it is usually desirable to begin 
the reduction of the data with a table of this kind. The frequency table 
provides the values for easy graphical representation, and from it such 
statistics as the mean and standard deviation may be calculated with 
much greater ease than from the original set of individual values. 

6. Selection of Class Values. Frequency tables may deal with 
either continuous or discontinuous variables. A continuous variable is 
one in which a single variate may take any value within the range of 
variation Thus the yield of a plot of wheat may take any value witlun 
the range from the lowest-yielding plot to the highest. A discontinuous 
variable can take only certain specified values. For example, in tossing 
5 coins we can have 5, 4, 3, 2, 1, or 0 heads, and no other values can occur. 

A frequency table for the number of heads in tossing 5 coins 100 times 
might be as follows: 


Class Values 

Frequency 

5 heads 

3 

4 heads 

16 

3 heads 

28 

2 heads 

31 

1 heads 

17 

0 heads 

5 

Ti)tal = 100 


The class values to be selected for such a table are obvious, and this is 
usually true for discontinuous variables. In some examides, however, 
it may be necessary to form the class values such that tlwj class interval 
is greater than unity. In tossing coins 20 at a time, we might use the 
classes 0-2 heads, 3-5 heads, and so forth 

If the variable is continuous, the classes for which the frequencies are 
to be determined must be chosen arbitrarily, the choice depending on the 
accuracy required in the computation of statistics from the table, the 
range of variation — which is, of course, the cliff crcnce between the lowest 
and the highest value of the sample — the number in the sample or total 
frequency, and the facility with which these classes can be handled in 
computation. In the first place, the greater the number of classes the 
greater the accuracy of the calculations made from the table. But there 
must be a limit to the number of classes we can handle conveniently, 
and these two opposing factors must be balanced up. A good general 
rule is to make the class interval not more than one-quarter of the stand- 
ard deviation. Of course we do not as a rule know what the standard 
deviation is before the table is made up, but it is possible to make a 
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rougjh estimate of its value from the range of variation. Tippett (3) 
has published detailed tables on the relation between the range of varia- 
tion and the standard deviation, and these have been summarized in a 
short table prepared by Snedecor (2). The following values are taken 
from Snedecor’s table after rounding off the figures to two significant 
digits. 

TABLE 2 

Values of the Ratio, Range Divided by the Standard Deviation (SD), 
FOB Sample Sizes from 20 to 1000 


Number in Sample 

Range/iSD 

Number in Sample 

Range/iSD 

20 

3 7 


5 5 

30 

4 1 


5 8 

50 

4 5 


5 9 

76 

4.8 


6 1 

100 


700 

6 3 

150 

5.3 

1000 

6 5 


Now suppose that we have a sample of 500 variates and the range of 
variation is 0.25 to 2.63. The difference is 2.38, and if we were to 
divide this by the standard deviation our table tells us that we would get 
a quotient of approximately 6.1. In order to make the standard devia- 
tion about one-quarter of the class interval, it is clear that its magnitude 
will have to be about 2 38/6.1 X 4 — 0.098. It is more convenient to 
have an odd number for a class interval than an even one, since it means 
that the midpoint of the interval does not require one more decimal 
place than we have in the values that define the class range. In the end 
we should probably decide in this case on an interval of 0. 1 1 . In making 
up the classes it is usual to begin with the lower boundary of the first 
class slightly below the lowest value, so that our classes and midpoints 
would finally be set up somewhat as follows: 


Class Range 


Class Value, or Midpoint 
of Class Range 


0.19 to 0 29 0.24 

0 30 to 0 40 0 35 

0.41 to 0 61 0.46 

0.52 to 0 62 0.67 


etc. 


etc. 


By following the above rules we ensure a suflicient degree of accuracy 
in any statistics that are calculated from the frequency table; but, if 
the frequency table is required mainly for the preparation of a graph as 
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described below, this method may pve classes that are too small, in that 
some of the classes may contain only very small frequencies or perhaps 
none at all. It is desirable in such cases to make the class interval from 
one-half to one-third of the standard deviation. 

In statistical literature one may come across references to Sheppard’s 
corrections for grouping These are designed to remove bias from 
certain statistics that are calculated from grouped data instead of from 
the individual values. Thus, in calculating S(x — xY/N — 1, it has 
been shown that the bias is positive and equal approximately to 1/12 of 
the class interval. In the tests for abnormality described in Chapter 
III, and in certain other specific calculations, it is necessary to make the 
adjustments, but in general practice they are usually ignored and in 
many tests of significance it is more correct to omit them altogether. 

The student should note carefully at this point that Sheppard’s cor- 
rections are for the purpose of removing a definite bias and in no sense do 
they make allowance for inaccuracies introduced by using groups that 
are too large. 

6. Sorting out the Variates and Formation of the Frequency Table. 
Sorting is greatly facilitated by writing the value of each variate on cards 
of a convenient size for handling. The class ranges are first written out 
on cards and arranged in order on a table. The sorting can then be done 
rapidly, and after it is finished it is very easy to run through the piles 
and obtain a complete check on the work. It is very important to have 
perfect accuracy at this point. In a series of studies a misplaced card 
may give a great deal of trouble at a later stage in the work. The fre- 
quency table is finally made up by entering the frequencies opposite the 
corresponding class values. 

Table 3 is a sample of a frequoricy table. It represents data on the 
carotene content of the whole wheat of 139 varieties. The class values 
are in parts per million of carotene in the whole wheat. In this instance 
a great deal of accuracy in the calculations was not desired, and it will 
be noted that the class values are larger than they would be if the rules 
for the formation of these values as outlined above had been followed. 
Check this point by reference to Table 2. 


TABLE 3 

Frequency Table for Parts per Million of Carotene in the 
Whole Wheat of 139 Varieties op Wheat 



f 0 85 

0 9f> 

1 07 

1 18 

1 29 

1 40 

1 51 

1 62 

1 73 

1 84 

1 05 


2 17 

Claaa Values 

to 

to 

to 

to 

to 

to 

to 

to 

to 

to 

to 

to 

to 


ICES 

1 06 

1 17 

1 26 

1 39 


1 61 

1 72 

1 83 

■ WI1 

2 05 

2.16 

2 27 

Frequency . 

2 

6 

14 

21 

24 

37 

13 

10 

4 

3 

2 

2 

1 
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7. Graphical Representation of a Frequency Table. Graphs of two 
types are in general use. The best type of graph and the one most 
commonly used is the histogram. It is a diagrammatic representation of 
a frequency table in which the class values are represented on the hori- 
Bontal axis, and the frequencies by vertical columns erected in their 
appropriate positions on the horizontal axis. The histogram is most 
useful when a curve for some theoretical distribution is being fitted. The 
nature of any disagreement between the theoretical distribution and the 
actual frequencies can be located readily when the theoretical curve is 



CAROTENl - PARTS PER MILLION 

Fio 1 — IJistofi;rani for the data of Table 3 

superimposed on the histogram. As an example the histogram for the 
data of Table 3 is shown in Fig 1 

The other type of graph is usually known as a frequency polygon. A 
straight line is erected for each frequency at the midpoint of the corre- 
sponding class value, and the ends of these connected in sequence by 
straight lines. It docs not give as accurate a picture for the sample as 
the histogram, but tends in its shape towards the smooth curve of the 
population from which the sample was drawn. 

8. Calculation of the Mean and Standard Deviation from a Frequency 
Table. After the frequency table has been formed, we add two more 
columns as indicated in the small example given below : 
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Claas Value 
or Midpoint of 
Glaas Range 
(x) 

Frequency 

/ 

Freqaeney 
Multiplied by 
daaa Value 
/X(*) 

Frequency 

Multipli^ by Square 
of Glass Value 
/X(a*) 

1 

2 

2 

2 

2 

4 

8 

16 

3 

7 

21 

63 

4 

6 

24 

96 

6 

1 

5 

25 

Totals 

wm 

60 > 2(x) 

warn 


On Bummating the last three columns we get iV, S(x), and which 
are the values necessary for the calculation of the mean and the standard 
deviation. The mean is given by: 

. 2(x) 


and the standard deviation by : 


Sm 



- [2;(x)]Viv 

N - 1 


(6) 


It will be noted that the numerator of the standard deviation is 
— x)^, and that to obtain it we have made use of the identity given 
in formula (3). 

The class values are very frequently nunibers containing two to four 
digits, in which case a great deal of labor can be saved by replacing them 
by the series of natural numbers 1, 2, 3, 4, • • • etc. By this method we 
obtain a mean and a standard deviation that we shall designate by x' 
and s', respectively. These can be converted into the true values by 
means of the following identities: 

X = (x' - l)i + Xi (7) 

8 = s'i (8) 

where i is the class interval and Xi is the first true class value. 

9. Coefficient of Variability. This is the term applied to the stand- 
ard deviation when it is expressed in percentage of the mean of the 
sample. It is a statistic of very limited usage owing to the difficulty of 
determining its reliability by statistical methods. The formula is 
obviously: 

C (coefficient of variability) = s 



( 9 ) 
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10. Exercises. 

1. Substitute the natural numbers 1, 2, 3, * * ■ 13 for the class values of Table 3, 
and calculate the mean and the standard deviation. Convert the calculated values to 
actual values using formulas (7) and (8). 


i' = 5.697 1.406 2.196 a » 0.2416 

2. Table 4 gives the yields in grams of 400 square-yard plots of barley. Make 
up a frequency table and histogram for these yields, using a class interval of 11, and 
make the first class 14 to 24. 

3. The areas in arbitrary units of 500 bull sperms are given m Table 5.^ Prepare 
the frequency table and histogram, using 16 classes, making the first class 123 to 125. 

4 . For either one of Exercises 2 and 3 above, calculate the mean and the standard 
deviation from the frequency table, using actual class values. Then replace the 
actual class values by 1, 2, 3, 4, • • , and recalculate the mean and the standard 
deviation 

Ex 2 2' = 13 055 X 151.60 s' = 2.880 s -- 31.68 
Ex. 3 f ' = 7.862 144.56 s' = 2 576 s = 7.728 

6. For the data in Tables 4 and 5, determine the class values that should be used 
to give a high degree of accuracy in the calculations. 

6. Prove the identity. 

2(* - D* - 2(x2) - l2(x)]VAr 

TABLE 4 

\iELDs IN Grams of 400 Squ ark- Yard Plots of Barley 


185 

162 


157 

141 

130 

129 

176 

171 

luo 

157 

147 

176 

126 

175 

134 

169 

180 

180 

128 

169 

205 

BTl 

117 

144 

125 

165 

170 

153 

186 

164 

123 

165 

203 

156 

182 

164 

176 

176 

150 

216 

154 

ng 

203 

166 

155 

215 

190 

164 

204 

194 

148 

162 

146 

174 

185 

171 

181 

158 

147 

165 

157 

180 

165 

127 

186 

133 

170 

134 

nn 

1^ 

160 

128 

152 

165 

139 

146 

144 

178 

188 

133 

128 

161 

160 

167 

156 

125 

162 

128 

103 

116 

87 

123 

143 

130 

119 

141 

174 

157 

168 

165 

180 

158 

130 

ISO 

168 

145 

166 

118 

171 

14J 

132 

126 

171 

176 

115 

165 

147 

186 

15/ 

187 

174 

172 

191 

155 

160 

130 

144 

[FTj 

146 

159 

164 

160 

122 

175 

156 

119 

IJ.} 

116 

134 

157 

182 

209 

136 

153 

160 

142 

179 

125 

149 

171 

186 

106 

175 

189 

214 

169 

16(' 

164 

105 

189 

m 

118 

140 

178 

171 

151 

192 

127 

148 

158 

174 

101 

134 

188 

248 

164 

206 

185 

102 

147 

178 

180 

141 

173 

187 

167 

128 

139 

152 

167 

131 

203 

2J1 

214 

177 

161 

104 

141 

161 

124 

130 

112 

122 

102 

155 

EE] 

170 

166 

156 

131 

170 

201 

122 

207 

180 

164 

131 

' 211 

172 

170 

140 

156 

100 

181 

181 

150 

184 

154 

200 

187 

160 


107 

143 

145 

100 

176 

162 

123 

180 

104 

146 

22 

160 

107 

70 

84 

112 

162 

124 

156 


101 

1.18 

141 

143 

135 

163 

183 

09 

118 

150 

151 

83 

136 

in 

101 

155 

164 

98 

136 


168 

130 

111 

136 

120 

122 

DU 

179 

172 

102 

171 

151 

142 1 

103 

174 

146 

180 

140 

137 

138 

104 

iTtrl 

120 

124 

126 

126 

147 

115 

148 

105 

154 

140 

130 

163 

118 

120 

127 

130 

174 

167 

176 

179 

172 

174 

167 

142 

160 

122 

163 

144 

147 

123 

160 

137 

161 

122 

101 

158 

103 

110 

J64 

112 

57 

94 

106 

132 

122 

164 

142 

155 

147 

115 

143 

68 

184 

183 

107 

160 

138 

191 

133 

160 

156 

122 

Ml 

153 

148 

103 

131 

180 

142 

101 

175 

146 

181 

111 

110 

154 

170 

168 

175 

175 

146 

148 

iiiyl 

106 

123 

121 

154 

148 

91 

93 

74 

113 

70 

131 

119 

96 

80 

07 

98 

106 

107 

60 

1 

86 

94 

120 


‘ Data by courtesy of A. Savage, Department of Animal Pathology, University 
of Manitoba. 
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TABLE 5 

Areab in Ahbitraky Units of 500 Bi^ll Sperms 


139 143 

140 148 

153 153 

149 149 

158 157 

141 141 

144 144 

138 149 

144 144 

134 124 

146 146 

139 139 

149 149 

161 160 

142 141 

136 136 

146 146 

155 154 

143 142 

134 ^ 131 

140 139 

148 148 

149 149 

137 136 

139 139 

136 136 

153 153 

150 150 

133 133 

134 132 

147 147 

146 145 

135 137 

151 148 

145 144 

157 156 

160 150 

141 141 

143 143 

144 146 

146 146 

134 135 

161 151 

142 141 

135 133 

138 140 

158 152 


140 

140 

n 

139 

140 



139 

147 

Kn 

■tn 

155 

142 


141 

141 

147 

147 

149 

148 

153 

153 

155 

141 

149 

148 

147 

147 

157 

141 

141 

143 

141 

139 

138 

159 

145 

144 

144 

146 

149 

148 

148 

148 

144 

146 

146 

146 

124 

134 

132 

136 

139 

138 

138 

138 

152 

150 

160 

150 

149 

154 

154 

153 

159 

135 

154 

164 

141 

141 

141 

142 

135 

137 

136 

135 

146 

145 

140 

140 

153 

153 

153 

153 

142 

142 

147 

147 

130 

129 

131 

130 

139 

139 

127 

137 

147 

147 

147 

147 

149 

149 

149 

146 

136 

137 

137 

136 

152 

152 

152 

152 

137 

145 

144 

146 

155 

135 

158 

158 

150 

150 

151 

151 

134 

129 

130 

141 

127 

137 

128 

125 

169 

165 

162 

162 

145 

144 

146 

145 

137 

127 

134 

132 

148 

147 

147 

149 

144 

146 

144 

143 

137 

137 

137 

137 

162 

152 

162 

152 

143 

142 

142 

142 

144 

144 

144 

144 

145 

145 

145 

138 

146 

146 

146 

145 

167 

156 

156 

167 

151 

150 

150 

150 

142 

141 

141 

142 

133 

150 

151 

149 

140 

153 

153 

148 

141 

141 

142 

141 


138 138 138 138 

139 145 145 145 

160 159 159 160 

145 145 144 146 

148 148 149 149 

149 149 149 149 

148 159 161 161 

143 143 143 142 

161 155 137 136 

145 145 144 146 

162 162 153 153 

145 146 141 143 

137 125 123 134 

140 140 146 139 

152 151 149 149 

155 149 149 149 

154 155 154 154 

142 142 142 141 

135 137 137 137 

140 138 138 140 

153 153 153 155 

150 152 152 150 

129 129 134 134 

134 132 133 133 

147 149 148 147 

146 146 145 146 

134 136 135 129 

151 150 152 152 

146 145 145 145 

157 157 157 168 

150 150 152 151 

143 142 141 141 

136 141 143 143 

149 144 144 144 

144 146 116 146 

135 135 127 126 

149 149 150 151 

143 143 143 142 

136 137 135 133 

152 152 152 151 

138 140 140 140 

146 146 140 139 

140 139 138 153 

145 145 146 145 

157 157 157 156 

150 150 150 152 

142 142 141 143 

139 139 139 138 

147 147 156 168 

143 139 139 139 
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CHAPTER III 


THEORETICAL FREQUENCY DISTRIBUTIONS 

1. Characteristics of Frequency Distributions of Biological Variates. 
A frequency table may be used to furtiish an estimate of the frequency 
distribution of the population from which the sample has been taken. 
For example, we could take any one of the frequency tables of Chapter 
11 and draw a smooth curve through the upper ends of the columns of the 
histogram. We would draw a smooth curve because the parent popula- 
tion is assumed to be infinite and each point on the base line could be 
represented by a frequency, or, to be more specific, the height of the. 
perpendicular line from any point on the base line to the curve would 
represent the proportion of the total frequency of the population having 
the value represented by the point. This method, however, would not 
be very satisfactory, as the position of the curve would be» to a consider- 
able extent, a matter of individual judgment. Also, the sample studied 
might indicate, owing to errors of sampling, certain irregularities and 
lack of symmetry which might be entirely absent in the population. 
Furthermore, to be consistent in our logic, it follows that we are not so 
much interested in drawing a curve that fits the sample as we are in 
setting up a theoretical curve as a hypothesis and then determining 
whether or not the data of the sample agree with the theoretical fre- 
quencies. In setting up our theoretical curve, it is of course natural 
that we set up one that is likely to agree fairly well with the data of the 
sample, and this is only saying in other words that we should set up a 
reasonable hypothesis. We could set up a whole series of theoretical 
curves, the majority of which would have no resemblance whatever to 
the histogram of the sample; but obviously this would be a mere waste 
of time. To deduce a theoretical distribution into which our sample is 
likely to fit, it is necessary to study the characteristics of the frequency 
tables for biological variates as a whole and work out a logical theory for 
setting up the theoretical values. If we examine the histograms of 
Chapter II for three different kinds of biological variates, we find that 
they have certain characteristics in common. Close to the mean, the 
variates occur with much greater frequency than they do at some dis- 
tance from the mean; but the reduction in the frequencies from the mean 
to the extreme tails of the distribution is not uniform, with the result that 

20 
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if a smooth curve ia drawn through the tope of the columns of the histo- 
grams it is seen to resemble an isosceles triangle but with a rounded top 
and very much flattened base. A curve of this type is found to resemble 
very closely a definite type of mathematical curve; but to understand 
more easily the reasoning behind the derivation of this curve it is neces- 
sary for us to look into the characteristics of another theoretical dis- 
tribution that is appropriate for discontinuous variables. 

2. The Binomial Distribution. In Chapter I we derived a theoretical 
distribution for the experiment on identifying varieties of wheat. This 
will be found in Table 1 . Each theoretical frequency was derived by the 
direct application of elementary theorems of probability, and if, instead 
of dealing with specific numbers of pairs of samples, we had dealt with 
the problem as a general one for any number of pairs of samples we 
would have derived the linomial diatnbution. Thus the theoretical 
frequencies of Table 1 can be written out at once from the terms of the 
expansion of the expression + ^)®. These are: 

64 64 64 64 64 64 64 

wherein we note that the theoretical frequencies are stated as propor- 
tions of the total munber and express directly the probabilities of par- 
ticular combinations. In general for rimilar problems where there are 
alternative possibilities such as ri^t or wrong placingB of pairs of 
samples, heads or tails in the tossing of a coin, an ace or any other num- 
ber in the throwing of a die, etc., the theoretical distribution can be 
written down directly by expanding the binomial (p + g)**, where n is 
the number of events in any 1 trial, p is the probability of the occurrence 
of the event in 1 way, q is the probability of the occurrence of the 
event in the alternative way, and p + g = l. Ifp = gwe obtain a 
S 3 mmetrical distribution, but if p is not equal to q the distribution is 
asymmetrical or skewed. 

There are many applications of the binomial distribution in statistical 
anal 3 rBis, and one application of particular interest will be dealt with in 
Chapter X. For the present it is sufficient to note that the form of the- 
distribution is somewhat similar to the actual distributions of Chapter 
II, which we have concluded are fairly typical for biological variables in 
general. However, the binomial distribution is not suitable as a theoret- 
ical distribution for continuous variables, as in itself it is essentially 
discontinuous; so that if we make any use of it for continuous variables 
it must be as a stepping stone to some more general type of distribution. 
The biological variables we have studied indicated from the samples for 
which histograms were made that the parent populations were essen- 
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tially symmetrical. The comparable rituation for the binomial dia- 
tribution would occur when p ^ q. Starting from this point, therefore, 
let us suppose that n is infinitely large; and, in graphing the histogram 
for the theoretical distribution, the columns which will also be infinite 
in number are represented by vertical lines only. The result will be a 
smooth curve, and by carrying through this procedure algebraically and 
making certain approximations we can arrive at an equation for a 
smooth curve. This is the expression for what is commonly known as 
the normal frequency distribution. 

3. The Normal Distribution. Most variables dealt with in biological 
statistics show in their actual distributions only minor deviations from 
the theoretical normal distribution defined by: 



where a is the standard deviation of the population, N is the total num- 
ber of variates, e is the base of the Napierian system of logarithms, and 
y is the frequency at any given point x, where x is measured from the 



Fig. 2. — Sketch of a normal curve, the base line measured in units equal to the 

standard deviation (a). 

mean of the population. The curve expresses, therefore, the relation 
between y and x, with y as the dependent variable. Figure 2 is a sketch 
of a normal curve. It illustrates the measurement of x from the mean 
of the population which is located at the point where the dotted line has 
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been erected. For the value of x taken, y is the perpendicular distance 
from that point to the curve. 

Equation (1) may be written: 


y 





2 ) 


and putting z for y (ff/N) we have: 

z = 




(3) 


and since x/c varies in actual practice only from 0 to 6, the values of z 
have been tabulated for all the values of x/a from 0 to 6 proceeding by 
intervals of 0.01. Any given value of z can then be transformed to ^ by 
multiplying by N/a for the particular population with which we are 
. dealing. In other words, for a given population for which N and c are 
known, we can proceed with a set of tables to plot the theoretical smooth 
CTtrve. 

A smooth curve plotted by the above method is an estimate of the 
form of the infinite population from which the sample has been drawn; 
but what we often require is the theoretical frequency distribution corre- 
sponding to the actual frequency distribution of the sample. That is, 
we require the theoretical normal frequencies for the arbitrarily chosen 
class values of the actual distribution. For this purpose, if N is taken 
as 1, equation (1) becomes: 

e-jg)* 


which can be integrated from x = minus infinity to x = any assigned 
value. This gives the area under that portion of the curve, and we will 
represent it as ^(1 + a). The integration is started at x = minus 
infinity, because the normal curve never actually touches the base line 
althou^, at X = — 6, y is an exceedingly small value. The reason for 
expressing the area as ^(1 + a) or ^ will be seen from an exam- 
ination of Fig. 3. For any assigned value of x the area within the 
limits of =kx is represented by a. Therefore, from x = minus infinity to 
X » any assigned value, if the total area of the curve is 1, the area is 

i + 

The tabulated values of z and ^(1 + a) for values of x/ir from 0 to 6 
are given in Sheppard’s Tables of Area and Ordinate in terms of 
AbseisBa.” These are commonly referred to as Sheppard’s tables of the 
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probability integral. The detailed application of these tables to a prac- 
tical example is described below under Section 4. 

4. Me&ods of Calculation. 



Fio. 3. — Sketch of a normal curve showing ordinates erected at xlv - + L and 
x/e = — 1. The unshaded area = a, and the shaded area = (1 — a). 

Example 1 . The calculations necessary to fit a normal curve to an actual 
frequency distribution and to determine the normal frequencies corresponding to 
the actual frequencies are given in Table 6. The data are for the transparencies of 
400 red blood cells taken from a patient suffering from primary anemia (4). The 
tranqiarency is taken as the ratio of the total light passing through the cell to the 
area of the cell For this distribution £' = 7.06 and 9 = 2 45. 

The calculations can best be descrilied by considering each column of the table 
The columns have been numbered at the head of the table for convenient reference. 

Column (1): The class ranges are as described in Chapter 11. Note that 
unit class intervals have been used. This is necessary in obtaining but makes 
no difference to the remainder of the calculations After setting up the class 
ranges, the actual frequencies may be entered as in column (10), but it is of no 
consequence when these are entered as they are not used in the calculations. 

Column (2). In order to understand clearly the meaning of the class limits, 
refer to any histogram as in Chapter II, Fig 1, or Exercises 2 and 3 The limits 
correspond with the hues bordering the columns of the histogram. The mean of 
the sample is placed according to the class range in which it falls. In this case 
the mean is 7 06 and must be placed opposite the class range 6.6-7.5. The limits 
are then entered by passing in both directions from the mean. The class in 
which the mean falls will have two linuts, but for each of the others we take only 
the one farthest from the mean. 
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TABLE 6 


Calcitlation or Ordinatbs fob Fittino a Nobmal Cubvb, and 
Thborbtical Normal Frbqubncieb 


(1) 

(2) 

BH 

(4) 


(0) 

(7) 

(8) 

(9) 

(10) 

CImb 

Ranse 

Clan 

Limita 

i 

dim 

■ 



“(1 + .) 
2 1 

Theoratieal 

Normal 

FraquaDoln 

Actual 

Frequanein 



9.66 

3 90 

0 0002 

0.03 

1.0000 

400 00 





8.60 

3.49 

0 0009 

0 15 

0 9998 

899 92 

0 08 




7 60 

3.08 

0.0036 

■XU 

0 9990 

399.00 

0.82 




0 60 

2.08 

0 0110 

1.80 

0 9963 

398.52 

1.08 


0 0- 1.5 

1 5 

6 60 

2.27 

0.0303 

4 95 

0 9884 

395.30 

3 10 

4 

1 G- 2 5 

2 6 

4.50 

1 80 

0.0707 

11 64 

0.9680 

887 44 

7.92 

11 

2 0-36 

3 6 

3.66 

1.45 

0.1394 

22 70 

0.9205 

370 00 

10 84 

17 

3 6-46 

4.6 

2 60 

1 04 

0 2323 

37 92 

0 8608 

340 32 

30 28 

29 

4 6- 6.6 

6.5 

1 56 

0 04 j 

0 3251 

53 08 

0.7389 

295 60 

44 70 

43 

6 6- 6.6 

0.6 

0 60 

0.23 

0 3885 

03.43 

0 5910 

230 40 

69.10 

50 

,6 6- 7.5 

7.06 

0.00 

0 00 

0.3989 

06.12 

0 6000 

200 00 

04.90 

68 

7 6- 8.6 

7.6 

0.44 

0 18 

0 3926 

04.08 

0.6714 

228.60 

00 40 

63 

8.0- 9.6 

8 6 

1.44 

0.69 

0 3352 

64 72 

0 7224 

288 90 

47 60 

01 

9 0-10 6 

9 6 

2.44 

1 00 

0 2420 

39.51 

0 8413 

330 62 

31 10 

25 

10 0-11 6 

10 6 

3 44 

1 40 

0.1497 

24 44 

0 9192 

307 08 

18.24 

20 

11 0-12 5 

11 0 

4 44 

1 81 

0 0776 

12 65 

0 9648 

385.92 

8.80 

9 

12 0-13 5 

12 6 

6.44 

2 22 

0 0339 

5 53 

0 9808 

304 72 

3.56 

4 

1.^ 0-14 6 

13 6 

0 44 

2 63 

0 0126 

2 06 

0 9957 

398 28 

1.24 




7 44 

3 04 

0 0039 

0 64 

0 9988 

209 52 

0 30 




8 44 

3 44 

0 0011 

0 18 

0 9997 

399 88 

0 08 




9 44 

3 86 

0 0002 

0 03 

0.9999 

309.96 

0 04 




10.44 

4 20 

0 0000 

0 00 

1 0000 

400 00 










Total 

400 

400 


Coliunn (3): The deviation of the class limit from the mean. Note that this 
corresponds to x in the discussion above. 

Column (4): Figures in previous column divided by the standard deviation. 
The latter is calculated using \init class intervals, and from the formula 


9 



(g - X) 

N 


2 


Column (6): Values of s from Sheppard's “Tables." 

Column (6): Corresponding z values multiplied by N /a. 

Column (7): Values of J(1 + a) from Sheppard's “Tables.” 

Column (8): Corresponding ^(1 + a) values multiplied by N. 

Column (9): Differences between consecutive values in column (8). Begin 
at 400 at each end and go towards the center. At the center the two differences 
are added. Note that the theoretical frequencies are not kept in line with the 
values in column (8), but are lined up with the corresponding actual frequencies 


in column (10). 

Column (10): The actual frequencies. 
























26 


THEORETICAL FREQUENCY DISTRIBUTIONS 


6. Probability Calculations from the Normal Curve. We have 
observed from the previous exercises and examples that most biological 
variables tend to follow the normal distribution and that methods are 
available for making, for any particular sample, an estimate of the form 
of the normal distribution from which the sample was drawn. Since 
the normal distribution can be expressed by a mathematical equation, 
the area of any section of the curve cut off by an ordinate can be deter- 
mined readily by integration of the equation, and for all practical 
problems this work has been peiformed and tabulated in Sheppard’s 



Fiq 4 — Sketch of a normal curve showing the proportions of the total area below 
and above the ordinate erected at dfu — + 2. 

‘Tables.” It remains to show how these facts form the basis for tests of 
significance in statistical problems. 

If a variable is normally distributed and the mean and standard 
deviation of the population are known, we can draw the curve and erect 
an ordinate at any point. Suppose that such an ord^ate is erected at a 
point which is at a distance, on the positive side of the mean, exactly 
equal to twice the standard deviation. Thus d/<r = 2, and from 
Sheppard’s “Tables” we find that + a) = 0.9772. Taking the total 
area of the curve as 1, the area to the left of the ordinate is 0.9772, and 
that to the right of the ordinate is (1 — 0.9772) = 0.0228. Assuming 
a population of 1000 variates, it is obvious that 22.8 of these variates 
would be greater than the mean by an amount equal to 2 or more times 
the standard deviation. Hence if one variate is selected at random from 
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the 1000, it is obvious that the probability that thw variate will exceed 
the mean to the extent of 2 or more times the standard deviation is 
22.8/1000. Reference to Fig. 4 will make this point clear. 

Looking at the same problem from another angle, we inquire as to 
the probability, in selecting a variate at random, that this variate shall 
fall outside the limits of plus or minus twice the standard deviation. 
We erect two ordinates, one at djc = — 2, and one at d/<r = + 2; and 
our problem is to find the area in both tails of the curve. Obviously 
this will be [1 - ^(1 + a)] X 2 = (1 - 0.9772) X 2 = 0.0456. The 
probability that a single variate selected at random will deviate by an 
amount equal to or greater than =b2 is 45.6/1000, or approximately 
1 / 22 . 

Probability results are sometimes expressed in terms of odds. If the 
probability is 1/22, the odds are 1 out of 22, or, as usually stated, 1 to 21. 

For the case above, where the deviations in both directions are con- 
sidered, note that the probability is given directly by [1 — -Jd + a)] X 
2 = 1 — a. The odds are given by a/(l — a) : 1 

Some examples follow that should make the whole procedure per- 
fectly clear. 

Example 2. The mean (rn) of a population is 26 4, and the standard deviation (^) 
is 2.0. Find the probability that a single variate selected at random will be 29.4 or 
greater. 

The deviation (d) ■■ 29.4 - 26.4 ■=+3 0. Hence d/a ■= f ■■ 1 6. For d/a ■■ 1.5, 
}(1 + a) « 0.9332. The probability (P) » (1 - 0 9332) « 0.0668. 

Example 8. For the ai^ve population, find the probability that a single variate 
selected at random will deviate from the mean to the extent of 3.6 or more. 

d 3.6 

d-±3.6 --^-1.76 

a 2 

For d/a = 1.75, i(l + «) - 0.9699. a - (0.4699 X 2) - .0 9198 

Hence P » (1 - a) - (1 - 0.9198) - 0.0802. 

Example 4 . Determine the value of d/a corresponding to P == 0 05. 

P - (1 -«) =006 
a = (1 - 0 06} -= 0.95 
J(1 + a) = (0.5 + 0.4750) = 0.9760 

From Sheppard’s “Tables,” d/v'= 1-96. 

6. Tests of Departure from Normality.* The x* test of Chapter 
IX, Example 19, on the goodness of fit of actual to theoretical normal 

* Students studying statistics for the first time are advised to pass over the 
remainder of this copter and come back to it at a later date. 
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frequencies is a general test of the normality of a distribution, and, by 
noting those classes that make the greatest contribution to x^, we can 
come to some decision as to the type of departure from normality. The 
test described here is one that involves the calculation of two statistics 
that are direct measures of the type and degree of abnormality, Fisher 
( 1 ). 

Types of Abnormality. Frequency distributions that depart signifi- 
cantly from the normal may be divided roughly into three classes: 

(a) Skew Distribuliona. The degree of skewness of a given distribu- 
tion is indicated approximately by the measure 

Mean — Mode 

Skewness 

<r 


where the mode is the position on the base line, or x ordinate, of a per- 
pendicular line drawn to the maximum point of the curve. This 
measure is obviously zero for the normal cfistribution, as the curve is * 
symmetrical and the mean and the mode coincide. When the mode is 
greater than the mean we have negative skewness, and when less than 
the mean, positive skewness. 

(6) PlatykurtiCt or flat topped. The shoulders of the curve are filled 
out and the tails depleted. 

(c) Leptokurticy or peaked. At the center the curve is higher and 
more pointed than the normal, and the tails are extended. 

In certain distributions we may have skewness as well as kurtosis as 
indicated by (b) and (c). 

Test for Abnormality. The type of abnormality of a distribution can 
be determined directly by calculating two statistics known as gi and 92- 
These are calculated from the k statistics ki, k^t k^, and k^y that are in 
turn derived from the sums of the powers up to 4 of the deviations from 
the mean. 

One of the most convenient methods for the calculation of the k 
statistics is to obtain first a series of values ai - - - 04, defined as follows: 


S(x) 

2 ( 1 *) 
«* = nr 


2(i*) 

04 = W 


iV 


From ai • - • 04, we calculate a series of statistics known as the moments 
(vi • - - 1/4), which in this form are uncorrected for grouping in the fre- 
quency table. 
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i;i = ai 


W2 = 02 — a? 

V 3 = 03 30]02 “f* 

j;4 = 04 — 40103 + 6 o? 02 — 3o} 


The k statistics are then given by: 
hi = 


“ {(N -l)(N- 2)) 


V2 

m 


fc4 = 


N2 

{N - 1) (A^ - 2) 


(N + 1)^4 - S{N - l)vl- 
N ^3 


Two of the k statistics hz and A;4 require (correction for the interval of 
grouping of the frequency distribution. For a unit interval the cor^ 
rected values are given by; 

^2 ~ ^2 iV f Ond ^4 = /C4 “h 


Corrections for other intervals of course, not be necessary; as it is 
always possible to use a unit interval for the purpose of calculating the k 
statistics. 

The measures of curve typx; gi and g 2 are given as follows, with their 
standard errors: 


(*2)” 



SEgi 


SEg2 


y(N - 


dNiN - 1) 


iN - 2) (AT + 1) {N + 3) 


“n/lv - 


24N(N - ly^ 


(.V - 3) (A’ - 2) {N + -6) (N + 5) 


For normal distributions both and gs are zero. The former is a 
measure of symmetry and has the same sign as (mean — mode). Figure 5 
illustrates positive and negative skewness as indicated by positive and 
negative v^ues of gi. A positive value of 92 indicates a peaked curve, 
and a negative value a flat-topped curve. These two types are also 
illustrated in Fig. 5 (see page 31). 



30 


THEORETICAL FREQUENCY DISTRIBUTIONS 


Example 6. We shall take as an example to which to apply the test for normality 
the frequency distribution given in Table 7, which also contains the necessary cal- 
culations. We get: 

gi - + 0.184 SEgi -0.227 

gt — + 0.0188 SEgi — 0.451 

The signs of gi and g 2 indicate that the curve departs slightly from normality in 
having a slight positive skewness and in being slightly peaked, but the values of 
gi and g 2 are very much less than twice their standard errors so we conclude that 
there is no evidence of a significant depart.ure from normality 

When the number of classes is fairly large it is desirable to calculate the k statistics 
using an assumed mean We measure x in terms of the deviations from the assumed 
mean and proceed exactly as in Table 7. Table 8 is an example of the calculation 
of the k statistics by this method, using the same data as in Table 7. 


TABLE 7 


Calculation of the k Statistics 


X 

Frequency 

fx 



fx* 

1 

1 

1 

1 

1 

* 1 

2 

6 

12 

24 

48 

96 

3 

13 

39 

117 

351 

1,053 

4 

25 

100 

400 

1,600 

6,400 

6 

30 

150 

750 

3,750 

18,750 

6 

22 

132 

792 

4,752 

28,512 

7 

9 

63 

441 

3,087 

21,609 

8 

5 

40 

320 

2,560 

20,480 

9 

2 

18 

162 

1,458 

13,122 


S(x) . 


(W = 113) 

555 

3007 

17,607 

110,023 

ai. 

. 


4.911,504 

26.6106 

-24.1229 

155.814 
-392 094 
236.959 

973 655 
-3061 124 
3851.545 
1745 739 

fi • 

fci 

. . i;4 

. . *4 


4.9115 

4 9115 

2.4877 

2 5098 

0 679 

0 697 

18 337 
0 103 

Corrections 


49115 

-0 0833 

2 4265 

0 697 

0 008 
0111 


gl “ 

0 697 

- + 0.184 

SEgi 

= 0 227 



(2.4266)’* 



g2 * 

0.111 

- + 0.0188 

SEg2 

- 0 451 



(2.4265)" 
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LEPTOIU«TIC PLATYHURTIC 

Fig. 5 — Tllustrating types of abnormality in frequency distributions 
MO = mode, and ME mean. 


TABLf: 8 

Calculation k Statistics Using an Abbuubd Mean 


Deviations (d) 
from Assumed 


z 

/ 

Mean 

fd 


/d* 

/d* 

1 

1 

-4 

- 4 

16 

- 64 

256 

2 

6 

-3 

-18 

54 

-162 

486 

3 

13 

-2 

-26 

52 

-104 

208 

4 

25 

-1 

-25 

25 

- 25 

25 


5 

30 






6 

22 

1 

22 

22 

22 

22 

7 

9 

2 

18 

36 

72 

144 

8 

5 

3 

15 

45 

135 

406 

9 

2 

4 

8 

32 

128 

512 

S(d) 

(AT -113) 


-10 

282 

2 

2,058 

Oi. . 

,.04 


-0 088,496 

2 4956 

0 017,699 

18.212 





-0 0078 

0 662,552 

0.006 






-0 001,386 

0117 


-0 000 



V2 - • ’ V4 

h,,,k4 


etc. 


2 4878 
2 5098 
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7. Bnrdses. 

1. Calculate the ordinates (y) and the theoretloal normal frequencies for the 
frequency distribution of either Chapter II, Exercise 2, or Chapter II, Exerdse 3. 
Totalling the theoretical frequencies will provide a check on the csleulationa. 

S. Make two graphs for Exerdse 1. 

(o) Histogram of actual frequencies and smooth normal curve. 

(h) Histogram of theoretical frequendes and smooth normal curve. 

8. Examin e equation (1) in Section 3 above, and show how the value of v affects 
the shape of the curve. 

4. If the mean of a population is 21.66 and v is 3.21, determine the probability 
that a variate taken at random will be greater ♦hA" 28.55 or less than 14.75. 

P - 0.03. 

5. If, for the population described in Exe rcise 4, the standard deviation of the 
mean of a sample of 400 variates u a/\/iOD, find the probability that the mean of 
any one sample of 400 taken at random will fall outside the limits 21 33 to 21.97. 

P - 0.046. 

6 . Determine d/d* values corresponding to the P values of 0 001, 0 01, 0 02, 0.10, 
0.20, and 0.50. 

7. Test the following distributions for departure from normality. 


(«) 

X. , 

1 

2 

3 

4 

5 

6 

7 

8 

9 

10 

11 

12 

13 

14 

15 

16 


/... 

1 

57 

185 

217 

177 

126 

87 

64 

30 

20 

14 

11 

13 

6' 

1 

2 

(b) 

X. . 

1 

2 

3 

4 

5 

6 

7 

8 

0 

10 

11 

12 

13 

14 

15 

16 


J. 

2 

3 

4 

5 

7 

11 

17 

30 

50 

34 

21 

10 

7 

5 

2 

2 

(e) 

X. . 

1 

2 

3 

4 

5 

6 

7 

8 

0 

10 

11 

12 

13 

14 

15 

16 



1 

7 

13 

19 

23 

26 

27 

28 

26 

24 

22 

17 

14 

9 

4 

1 


(0)^1 - 1.360,0^-2.143. (6) 01 --0.327,01 -0.039. (c) 0i = 0 107, 02 * - 0 766 
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TESTS OF SIGNIFICANCE WITH SMALL SAMPLES 

1. The Estimatioii of the Standard Deviation. In Chapter II, Sec- 
tion 2, it was pointed out that the best estimate of the standard deviation 
of a population from which a sample has been drawn is Vs(a: — 
where m is the mean of the population and N is the number in the sample. 
Since we never know the value of m, we use £ instead; but the substitu- 
tion of X in the above formula will not give us the best possible estimate 
of O’; actually it wiU give us an estimate that is too small. In other 
^ words, if we take a large number of samples and calculate a standard 
deviation for each one, the average value of our standard deviations will 
be low, and this will be true regardless of how many samples we take. 
As a nvatter of fact, if we take a large enough number of samples, we can 
predict with accuracy the extent of the negative bias in the average of 
the standard deviations. To the beginner these facts often appear 
somewhat mysterious, particularly the fact that the bias, in our estimate, 
can be removed, as pointed out in Chapter II, by using the formula 
VsCx - iY/N - 1. It may seem peculiar that the bias can be 
removed in so simple a manner. Now, it is easy enough to work out 
this proposition algebraically, but this does not settle the question 
necessarily for the beginner, as it is quite possible to work through a 
derivation and follow all the steps without really understanding the 
situation. Consequently, we shall not use the algebraic method here, 
but will try instead to point out why a bias should exist and why it is 
reasonable that it should be removed by dividing the sum of squares of 
the deviations from the sample mean by 1 less than the number in the 
sample. 

In the first place, we have noted alread> that the sum of the devi- 
ations from the mean of a sample or of a jx)pulation is zero (Chapter II, 
Section 1). We shall now note that the sum of the squares of the devia- 
tions from the mean is a minimum. If the mean of the population is m 
and we take a large number of samples of size N and in each case we 
determine 2 (a: — m)^, it follows that the sum of all these will be the 
same as if we had merely gone through the whole population without 
considering any portion of the variates as a sample Then, on dividing 
this total sum of squares by the total number and extracting the square 
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root, we would have the value of a for the whole population. It obvi- 
ously does not matter whether we divide the population into samples 
and determine tr for each one and then average, or merely take the whole 
population as one sample. However, this procedure is possible only in 
theory, as m is actual ly unknown. For each sample, therefore, suppose 
that we calculate \/s(x — 2)^/N and then average. Now, since the 
sum of the squares of the deviations from the mean is a minimum, the 
use of £ will give a minimum value for the sample; but, since the values 
of X vary from sample to sample, it is perfectly clear that 2(x — for 
any one sample will be as large as S(x — m)^ for the same sample only 
if X happens to be equal to m. No matter how slightly x varies from m, 
the sum of the squares of the deviations from the mean of the sample will 
be smaller than the sum of the squares of the deviations from the popu- 
lation mean, and hence the value of the standard deviation is under- 
estimated by the formula which has AT as a divisor. Now let us con- 
sider the extent of the bias and how it may be removed. There are N * 
values in a sample, and in theory each of the N variates contributes 
equally to the estimate of the standard deviation; but in calculating 
2(x x)^ we use one value, x, which is determined by the sample, and 

hence the effective weight of the sample is equal to N — 1 instead of N. 
All the values of one sample may be large, and if we could calculate 
2(x — my these values would contribute more to the total sum of 
squares than a set of values in another sample which are closer to m. 
Actually, since we take the deviations from the mean of the sample, the 
first sample would not necessarily contribute any more than the second 
sample. This brings out the idea that the mean used is fixed by the 
sample and to the extent of reducing the effective weight of the sample 
by 1. Thus we have the term introduced by R. A. Fisher, “degrees of 
freedom.” When a sample of N variates are used for purposes of estima- 
tion, its weight is only that of the number of degrees of freedom. For 
every statistic calculated from the sample and utilized in forming the 
estimate, there is a loss of one degree of freedom. Thus, in the present 
example of estimating the standard deviation, the statistic calculated 
from the sample is x, and there is a corresponding loss of one degree of 
freedom. This piinciple will be found to hold throughout all statistical 
procedure. 

2. Terminology and Symbols for Populations and Samples — 
Introducing the Term Variance. As pointed out above, we speak of 
population parameters which are true and undeviating values, and 
statistics which are estimates, from the samples, of the population para- 
meters. The statistics we have discussed so far are the mean x and the 
standard deviation s; and the corresponding parameters are m and o. 
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Very frequently in statistic&l procedure the square of the standard devia- 
tion, usually referred to as the variance ^ is the more convenient of the 
two statistics. Most tests of significance can be made by means of the 
variance, in which case the extraction of the square root in order to 
obtain the standard deviation is an unnecessary operation. In general, 
all discussions of methods of estimation refer equally to the standard 
deviation and the variance, and consequently in Example 6 below we 
confine our attention to the variance. 

Before proceeding with Example 6 it may be of assistance to sum- 
marize the symbols and terms that have been used up to this point, and 
any others that have not been used but are relative to those already dis- 
cussed. This summary is as follows: 

Parameters Statistics 

Mean .... m Mean a 

Standard deviation . tr Standard deviation s 

‘Standard deviation of a mean. <rtn Standard deviation oi stand- 


Variance 

a* or V 

ard error of a mean 


Variance of a mean 

or V„ 

Variance or mean square 

or V 



Variance of a mean 

or Vi 



Number in sample 

N OT n 



Degrees of freedom 

n 


Special notice should be taken of the term standard error^ which is coming into 
general use in place of the standard deviation of a sample mean. 

Example 6. The Use of Degrees of Freedom in Estimating the Variance. 

In Table 9 we have a set of random numbers taken fi im Tippett’s tables (6), arranged 
in 10 gioujiB of 20 numbers each The variation in these numbers may be assumed 
to be made up of two portions: (1) within the groups, and (2) between the gioups. 
But if the numbers have been selected at random these sources of variation will 
be equally balanced. They would be unbalanced if, for example, some groups had 
all small numbers and the other groups all large numbers The random selection 
of the numbers ensures that this shall not be the case. In terms of variance, the 
above statement with respect to variation is simply that the variances for within 
groups, between groups, and the total variance will all be equal within the limits of 
random sampling. Now, if for a particular set of numbers, as in this set, the variance 
for between groups is adjusted until it is almost exactly equal to the total variance, 
it follows that the variance within groups must also be almost exact!}' equal to the 
total We can determine, therefore, the variance within each gioup, and if our 
method is correct these should give an average value very close to that for the whole 
sample. 

The calculation of the variances within groups has been performed in Table 10 
by two methods. There are 20 numbers in each group, so that in each group we 
have 19 degrees ot freedom for the estimation of the variance. In column (7) of 
Table 10 the sums of squares are divided by the degrees of freedom, but in column (8) 
they are divided by 20, the number in the sample. At the foot of the table the total 
variance is again calculated by two methods. In the first case we divide by 100 and 
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TABLE 9 



A 

B 

C 

D 

E 

B 

G 

H 

I 

J 

1 

29 


14 



47 

28 

35 

18 

25 

2 

45 


49 



36 

24 

27 

16 

39 

3 j 

16 


29 

18 

46 

42 

10 

18 

34 

24 

4 

37 1 

11 

31 

28 

44 

36 

27 

44 

18 

30 

5 

50 1 

12 

19 

20 

28 

38 

11 

25 


24 

6 

10 

37 

20 

44 

40 

21 

42 

33 

29 

mhm 

7 

26 

44 

49 

41 

27 

41 

22 

49 

35 

31 

8 

19 

15 

15 

10 

28 

26 

mim 

11 

35 

10 

9 

24 

50 

11 

43 

27 

17 

17 

17 

42 

13 

10 

10 

14 

22 

19 

11 


33 

39 

50 

43 

n 

10 

22 

23 

10 

48 


44 

26 

21 

27 

12 

48 

mhm 

41 

13 

21 


32 

29 

11 

20 

13 

22 

46 

40 

31 

44 

21 

23 

16 

45 

39 

14 

13 

14 

12 

45 

16 

46 

25 

47 

18 


15 

28 

21 

39 

39 

36 

22 

27 

10 

31 

18 

16 

32 

15 

43 

23 

42 

34 

16 

20 

26 

ll 

17 

10 

37 

31 

11 

12 

50 

20 

12 

34 

46 

18 

11 

26 

34 

22 

48 

13 

47 

42 

22 • 

43 

19 

30 

29 

49 

35 

30 

46 

38 

50 

24 

44 

20 

37 

22 

37 

49 

30 

47 

12 

34 

42 

24 

Totals 

507 

548 

608 

558 

1 621 

702 

528 

584 

581 

577 


in the second case by 200 We have, therefore, four determinations of the variance 
as shown below Note that the last line is calculated inde}>endently and does not 
come from totalling the values above except for columns (2) and (3). 

By the first method we obtain for the average variance within groups a value 
that IS 99 94% of the total. By the second method the average variance is only 
95.43% of the total, and therefore uruiereftimates the true value by 4.57%. Where N 
is the number of variates in a sample, it follows therefore that the correct estimate 
of the variance is given by S(x — x)'^/N — 1. 

3. The Distribution of the Estimates of the Standard Deviation. If 
a large population is being sampled and each sample contains 100 
variates, wc will get a series of varying values for the standard deviation 
calculated from these samples. But, if, instead of taking samples of 
100 variates, we take samples of 10, it is to be expected that in the second 
case we will get values for the standard deviation fluctuating more widely 
than in the first case This is the same as saying that the distribution 
of the standard deviation is dependent on the number of degrees of 
freedom in the sample In this respect it is very much the same as a 
mean. In order to obtain from one sample a value for the mean that 
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TABLE 10 

Calculation or Variance Values or Figures in Table 9 
BT Groups or 20 and roR Whole Group 


(1) 

M 

(3) 

S(x*) 

H 

(6) 

T,VN 

(6) 

S(*-*)* 

(7) 

2(*-i)Vl» 

(8) 

2(*-*)*/20 

A 

507 

16,159 

25 35 

12,852 45 

3,306 55 

174 0289 

166.3275 

B 

548 

18,110 

27 40 

15,015 20 

3,049 80 

162 8842 

154.7400 

C 

608 

KWiV/M 

30 40 

18,483 20 

3,118.80 

164 1474 

155 9400 

D 

558 

18,620 

27 90 

15,568 20 

3,051 80 

160 6210 

152 5900 

£ 

621 

22,189 

31.05 

19,282 05 

2,906 95 

152 9974 

145.3475 

F 

702 

27,208 

35.10 

24,640 20 

2,567 80 

135 1474 

128 3900 

G 

528 

16,132 

26 40 

13,939 20 

2,192 80 

115 4105 

109.6400 

H 

584 

20,306 

29 20 

17,052 80 

3,253 20 

171 2210 

162.6600 

I 

581 

19,043 

29 05 

16,878 05 

2,164 95 

113 9447 

108 2475 

J 

577 

19,045 

28 85 

16,646 45 

2,398 55 

126 2395 

119.9275 






Av 

- 147 6642 

140 2810 




T 

ZC**) 



S(j-f)Vl99 

2(i?-«)V200 



• • • • 

5814 

198,414 

169,012 98 

20,401 02 

147 7438 

147 0051 


Average within Groups 
Total 


Method (1) 
Using Degrees 
of Freedom 
147.06 
147.74 


Method (2) 
Using Number 
in Sample 
140 28 
147.00 


is quite close to the mean of the parent population, we must take a large 
sample. Small samples will give us unbiassed estimates, but they will 
be more variable estimates. 

Now in Chapter II we observed that, if a population is normally dis- 
tributed and we know its standard deviation and mean, we can make a 
direct calculation of the probability of drawing from that population a 
flample with a mean of a given magnitude. This is, in a sense, a test of 
the significance of the mean of a particular sample, since if the prob- 
ability is very small we should conclude that the sample was not drawn 
from the population in question, but from some other population. 
However, the standard deviation of the i)opulation cannot be deter- 
mined, and the only value we have is the estimate a which has been cal- 
culated from the sample and varies from sample to sample. This 
brings us therefore to the general qu^tion of making tests of significance 
from the data of samples of any size. 
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4. Tests of Significance. The method of Chapter II for making 
probability determinations arose from our knowledge that the ratio of a 
mean of a sample to the standard deviation of the population from which 
the sample is drawn is normally distributed. T^his follows, of course 
because, if the mean is normally distributed and the standard deviation 
is constant for the population, the ratio of the two will also be normally 
distributed. Suppose, however, that we take the ratio of the mean of a 
sample to the estimate of the standard deviation s. Since s is more vari- 
able for small samples than for large ones, the ratio will obviously have 
a distribution that is dependent on the size of the sample, and, in order to 
determine the probability of the occurrence of any particular value of 
this ratio, we must know its distribution. This was worked out by 
‘^Student” (4) in 1908, and for the first time practical statisticians had 
placed in their hands a tool which could be applied in tests of significance 
for samples of all sizes. '^Student’’ gave first a set of tables for the 
distribution of £/s, which he designated by the letter Z. Later he 
prepared a table based on the distribution of t, which is x/s,. Fisher,* 
in ''Statistical Methods for Research Workers, gives a compact table of 
t for degrees of freedom varying from 1 to 30, and the probability levels 
P = 0.01, 0.02, 0.05, 0.10, and 0.90. These are the most convenient for 
general use, and are reproduced in part in Table 94. 

Example T. Two varieties of wheat are compared in 4 pairs of plots, there being 
1 plot of each variety in each pair. Referring to the two varieties as A and B, we 
determine the difference in yield A-B for the 4 pairs of plots, and the results are as 
follows in bushels per acre: 

Pair 1 2 3 4 

A-B 2 4 4 6 

The differences are all positive and are therefore in favor of the variety A; but we 
wish to make a test so as to be able to state whether or not the data arc m agreement 
with any hypothesis that we may set up. The obvious hypothesis here is that the 
varieties are not different in yielding quality, and consequently our theoretical dis- 
tribution is built up on that basis. If the varieties are not different, the data will 
be expected to give a value of t that is not improbable. If they are different, we will 
expect the data to give a value of t which will occur by random sampling in only a 
small proportion of the cases. Let us proceed to the calculation of L 

We note first that the mean difference is 4, and that the sum of the squares of the 
deviations of the individual values from the mean is 8. We then have « 8/3, 
the numerator being the number of degrees of freedo m availabl e for estimating the 
st anda rd deviation s Then a = V8/3, and s; « V^8/3 X 4, which simplifies to 
V 2/3 Finally i « 4 X V 3/2 = 4 87. Now if we examine Table 94 it is observed 
that the 5% value of t for 3 degrees of freedom is 3.18, and the 1 % value of ( is 6.84. 
Thus the value of t given by the data would occur according to the hypothesis in 
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less than 6% and somewhat more than 1 % of the cases. Our conclusion is that the 
difference observed is due to a real varietal effect, and is not a chance occurrence. 

It may be argued that in an example such as the above we are not actually 
testing the significance of the mean difference, because we are basing it on the distri- 
bution of wherein an exertional value of i may be due to extreme deviations in 
either the mean difference or the standard error. This point is actually only of 
academic interest, because in either case the two samples are proved to be different 
regardless of which factor bnngs about the exceptional value of t. When we consider 
• the actual problem of testing the difference in yield of two varieties, it is obvious 
that a real difference in the variation of the yields from plot to plot is so unlikdy a 
factor that in general we can disregard this viewpoint, and assume that the significant 
value of / is at least mainly due to a significant difference in the mean yields. 

6. Fiducial Limits. Stress has already been laid on the principle of 
estimation; and we come now to a method of setting up limiting values 
according to given probability levels, such that it can be said with a 
reasonable degree of certainty that the true value which is being esti- 
mated lies between these limits. In the example above, the difference 
'between the yields of the two varieties was found to be significant; but 
no attempt was made to set up two limiting values, one on each side of 
the mean difference of 4 bushels, and to state that according to a given 
probability level, the true mean difference was between these limits. 
If we can perform such an operation it will obviously be of great prac- 
tical value, because in the end we are not really concerned with being 
able to say only that one variety is a higher yielder than the other. 
Unless we can make a reliable estimate of this difference our experiment 
IS not contributing information of value in actual practice. 

It was emphasized in Chapter I that a test of significance involves 
setting up a hypothesis and determining the agreement between the 
hypothesis and the data of the experiment, and furthermore that any 
hypothesis whatever can be set up. In the example above, the hypoth- 
esis was that the mean difference in yield between the varieties was zero, 
and what we actually did was to find the value of t from the expression 
(x — fn)/8i, where m, the mean of the parent population according to 
the hypothesis, was taken to be zero. We can, however, take m equal 
to any value that we please, and we might choose for example to take m 
equal to 2. Then f = (4 — 2) X = 2.46, and this value is less 

than the 5% point. The inference from this test is that there is no 
definite evidence that the true difference is greater or less than 2. We 
begin to see therefore that, though our difference is significant, we cannot 
specify very closely the range within which the true value lies. Suppose 
now that we can locate a lower limit such that, if we substituted it for m 
in the i test, the value of t obtained would be exactly equal to its 5% 
point, and we determine in addition a similar upper limit. 1 he observed 
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difFerence could then be eaid to differ aignifioantly from either of the 
limiting values, and we could say with a reasonable degree of certainty 
that the true value lies between these limits. The procedure is simple, 
as all we have to do is to set up the equation for t with m as ui unknown 
and t equal to its value at the 5% point. Thus: 

3.18 - (4 - m) X VJ 

Solving for m we get an answer oS 1.40, and our limits an 0.80 to 3.40. 
It is now dear that, althoui^ our experiment gave a significant result, it 
did not enable us to estimate very accuratdy the true difference in yidd 
between the two varieties. These limiting values have been very aptly 
termed by R. A. Fisher the Jidveial Itmtfo, and in the present example we 
would describe them as the fiducial limi ts at the 6% point. 

6. General Meflioda for Testing the Significance of Differences. 
One’ of the most common problems in statistics is the testing of the sig- 
nificance of a difference between two means. The reasoning behind ■ 
such tests involves iHcturing an infinite population of differences for 
which the mean is sero. We have two samples for which the means are 
different; and we wish to know in what proportion of the cases'on the 
average, in the procedure of taldng purs of samples, we will get a differ- 
ence as large as or larger than the one observed. Tests of this kind fall 
into two classes: 

(a) Samples are distinct and the variates are not paired in any way. 
If there are two blocks of land and we take the yields of a group of plots 
from each block, and we wish to test the significanoe of the ^fference 
between the means for the blocks, we have a problem that falls into 
this class. The number of variates in the two samples may be either 
the same or different. Let the samples be designated as 1 and 2; then : 

Xi = mean of sample 1. 

£2 = mean of sample 2. 

£i — £2 — mean of difference to be tested. 

ni degrees of freedom for sample 1 which contains, 
therefore, ni -f- 1 variates. 

712 = degrees of freedom for sample 2 which contains, 
therefore, 712 + 1 variates. 

The calculations are carried out as follows: 

2(zi — £ 1 )^ = sum of squares for sample 1. 

X{x 2 — £ 2 )^ = sum of squares for sample 2. 
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4 


2(*i — ii)* + 2(12 ~ ife)* 


ni + n2 

V Wl + »2 + 2 


(ni + 1)'(»2 + 1 ) 

(ii — fe) 


( 1 ) 

( 2 ) 

(3) 


We enter the table of t under n = ni + n 2 . If the samples contain an 
equal number of variates, we have: 

(ni + 1) = (n 2 + 1) = iV 

/ 2 { i , - *,)2 + 2(*2 - 


and 


4 


2{N - 1) 


The, table of ^ is entered under n = 2{N — 1). 


(4) 

(5) 


Bxomple 8. Let £i « 196 42 and - 198 82; then (2i — £ 2 ) » 2.40. The 
aamples are taken independently, and consequently there is no reason for assuming 
that zi and X 2 are correlated. In sample 1 we have taken 9 variates, and in sample 2 
we have 7 variates. Hence ni « 8 and n 2 = 6. We calculate first Z(zi -- £ 1 )* 
and 2(472 — ^ 2 )*- We will assume that this is done, and we get: 

SCxi - £ 1 )* = 26 94 
Z(X2 - £ 2 )* = 18 73 


Then: 


and 


Total = 45 67 

1 81 

2 40 /63 


[45 67 __ 
" \ 14 “ 


t : ■ 2.62 

1 81 \ 16 


Entering the table of t under n » 1 4 we find that a ( value of 2 62 corresponds almost 
exactly with a P value of 0 02. Between the means of the two samples a difference 
of 2.40 would occur by chance in only 2 cases out of 100 


(h) Variates are paired; that is, each value of xi is associated in some 
logical way with a corresponding value of X 2 . Thus, if two varieties of a 
field crop are being tested in pairs of plots, each pair containing one plot 
of both varieties, we would have a problem of this kind. There will, 
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of course, be the same number of variates in the two samples so that, 
if there arc N pairs, there will be iV — 1 degrees of freedom avtulable 
for the comparison This follows logically from the fact that we are 
now dealing with individual differences and there is one difference for 
each pair of variates. 

The calculations are: 

s = -y/ S(i, /AT-l (6) 




I same as fonuula (3) 


(7) 


If the student should be confused to find later that as computed 
above is not the same as when obtained by the analysis of variance, it 
may be just as well to adopt the following method, which is identical 
with that of the analysis of variance. The value of t obtained* by the 
two methods is, of course, the same. 


- X2)2 - - 1 (8) 



i = same as formula (3) 


Example 9. In this example assume that the variates are paired, as in a feeding 
experiment where a senes of animals are paired up according to initial weight. 
One animal in each pair is given ration 1 and the other one ration 2 Tli&re are 
10 pairs of animals, and ihc difference between the mean gams per 100 pounds of 
feed at the end of the feeding period is 1 42 pounds. We shall assume that 


Then 

and 


- 1.)' - - HI* 

1 42 /m 

1 30 V 2 


2 44 
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Entering the i table under n » 9, we find that the P value is between 0.06 and 0.02, 
but closer to the former. We can take P 0.05 as approximately correct, so that 
the difference between the two means could only occur by chance in about 1 out of 
20 trials. 

7. Exercises. 

1. The figures below are for protein tests of the same variety of wheat grown in 
two districts. In district 1 the average for 5 samples is 12.74, and in district 2, the 
average for 7 samples is 13 03. If these are the only figures available, test the 
significance of the difference between the average proteins for the two districts. 





Protein Results 

District 1 . . . . 

12 6 

13 4 

11 0 12 8 13 0 

District 2 . . . 

13 1 

13 4 

12 8 13 5 13.3 12.7 12.4 

f>=1.04 p*»0 3, approximately. 


2. Mitchell (2) conducted a paired feeding experiment with pigs on the relative 
value of limestone and bonemeal for bone development The results are given in 
Table 11 below. 


TABLE 11 

Ash Content in Pxbcbntags or Scapulas op Paihb of Pigs 
Fed on Limestone and Bonemeal 


Pair 

Limestone 

Bonemeal 

1 

49.2 

51 5 

2 

53 3 

54 9 

3 

50 6 

52 2 

4 

52.0 

53 3 

5 

46 8 

51 6 

6 

50 6 

54 1 

7 

52 1 

54 2 

8 

53 0 

53 3 

Mean 

50 94 

53 14 


Determine the significance of the difference between the means in two ways: (1) by 
assuming that the values are paired, and (2) by assuming that the values are not 
paired. On the basis of your results, discuss the effect of pairing. 

(1) Paired: < 4 42, P » less than 0 01. 

(2) Unpaired: f » 2.48, P approximately 0.02. 


3. In a wheat variety test conducted over a wide area, the mean difference 
between two varieties was found to be 4.5 bushels to the acre. The standard error 
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of this difference was 1.6 biuhela per acre, and waa determined from 100 pain of 
plota. Set up the fiducial limits at the 5% probability level for the mean differenee 
in yield between the two varieties. 

Note that i can be taken as 1.06, then fiducial limits are 1.66 to 7.44. 



CHAPTER V 


THE DESIGN OF SIMPLE EXPERIMENTS 

1. What is Experimental Design? In Chapter I some ideas relative 
to experimental design were presented, but in view of what we have 
now learned of the t test it should be worth while at this point to repeat 
some of these ideas, and at the same time introduce any new concepts 
that have arisen out of later discussions. An experiment can be said to 
have a definite design if it has been carefully planned in advance, and if 
due attention has been paid to possible results and their interpretation. 
The latter point is probably the most frequently neglected. A great 
deal of time may be spent on the various details of procedure, and full 
preparations made for carrying the experiment through to completion. 
This may be assumed to be sufficient to ensure a successful experiment, 
but a long list of such experiments that contribute neither positive nor 
negative information is good evidence that careful planning of the pro- 
cedure is in itself incomplete. Only by thinking in terms of the various 
types of results that an experiment can yield is it possible to obviate 
some very costly mistakes. If these possibilities are thoroughly worked 
out it is self-evident that a complete failure is impossible. 

2. Planning to Remove Bias. Qnc of the commonest mistakes in 
experimental design is the failure to guard against biased results. Such 
experiments may give good results but their great weakness is that they 
are not beyond criticism; and regardless of the truth and importance of 
the results obtained the investigator may never feel quite happy about 
presenting them with conviction. Let us examine hypothetical plans of 
experiments that are subject to a bias of some sort. 

Suppose that we are to conduct an experiment on the value of feeding 
milk to school children. There are two neighboring schools, and milk is 
given to the children in one of the schools and not to those in the other. 
At the end of the experiment the children are compared on the basis 
of height, weight, etc., by means of the t test. The children from the 
school in which milk was given are fotind to be significantly heavier than 
those from the other school. The error in design is so obvious here that 
it is scarcely necessary to point it out. The experiment has shown that 
the children of the two schools ore significantly different in weight, but 
this might easily have been the case if no milk had been ^ven or even if 

46 
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the order of giving the milk had been reversed. In fact the experiment 
is not at all what it seemed to be at first. It consists actually of just 
two variates which are the two schools, and no determination of the 
error of such an experiment is possible. 

Now let us endeavor to improve the plan, and we will confine the 
giving of milk to pairs of boys or girls, one getting the milk and the other 
not. The pairs are selected at random, and in each pair the milk is 
given to the younger and not to the elder child. The reader will object 
that we are again introducing a bias in that the difference observed might 
easily be due to age and not to the effect of milk in the diet. This is 
perfectly true, so in order to overcome this defect we decide to give it to 
the younger child in one case and the elder child in the second case, 
alternating in this way throughout the entire group. Now the experi- 
ment seems to be perfect, and in truth it is much improved, but with a 
little thought it should be clear that we have succeeded in removing only 
the gross defects — those that are obvious to us at the outset and which 
anyone can remove with a little thought and a general knowledge of the 
problem being investigated. The chief trouble with our design is not 
that we have knowingly allowed some factor to bias the cxf)erimcut, but 
that we have not planned it in such a way that it is impossible for bias to 
enter in. A definite mcjthod is available for this purpose, which has 
already been referred to in Chapter I. It involves merely assigmng at 
random which member of each pair of children is to receive milk. This is 
a simple device and one whicli is absolutely trustworthy in the matter 
of removing bias. 

Numerous examples may be cited of experiments that are designed 
so that bias may enter in. One of the most common is the field plot test 
in which the varieties or treatments are arranged systematically in the 
blocks or replications It is not possible to discuss this particular prob- 
lem and deal with it fully until we have made a study of the methods of 
the analysis of variance, but we can consider the simple type of experi- 
ment in which only two varieties or treatments are being tested ancf they 
are arranged in pairs of plots. Here we are dealing with a series of 
differences, and we set up a hypothesis as, for example, that the mean 
difference is normally distributed about zero. On the basis of this 
hypothesis we can determine the proportion of the trials in which a dif- 
ference as great as or greater than the one observed will occur The 
validity of our test depends on its being designed so that if the hypoth- 
esis is true the distribution of the results from a large number of trials 
will be normal and will have a mean of zero. What would anyone after 
a little thought say of an experiment designed so that, if the varieties 
being tested are actually equal in yield, the result turns out according 
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to a large series of tests, either definitely positive or definitely negative? 
Yet this is just the kind of result that may be expected if the principle 
of randomization is not used in setting up the experiment. This applies 
particularly to the position of the varieties or treatments in the pairs. 

3. Designs that Broaden the Scope of the Experiment. This is 
another subject than cannot be treated fully at this stage, but a few of 
the general principles may be pointed out. Suppose that the all-inclu- 
sive subject of the experiment is the effect of milk in the diet of young 
animals. Most of us would reject this as a proper subject for experi- 
mental investigation at once, because we can see that it is one for which 
there is no possibility of obtaining a result that will be of practical value. 
In one group of animals the milk may be beneficial and in another group 
it may be of no value or even harmful, so that unless the experiment is 
repeated with all possible kinds of animals and the results with each 
kind studied separately we cannot expect to gather any valuable infor- 
mation. The decision with regard to an experiment of this type is likely 
to be that we should select one kind of animal in which we are particu- 
larly interested, and then confine the tests to a limited age group. In 
the first case the subject of the investigation called for an experiment of 
such enormous scope that the entire proposition was absurd. Now we 
have limited the scope of the experiment, but we have not gone as far 
as we might. Let us suppose that the investigator decides on pigs as 
the kind of animal to be tested, then he decides to use pigs of one age 
within the limits of one week, and finally that they shall be from the 
same litter. He has now gone to the other extreme and has set up an 
experiment such that, no matter how significant the results, they will 
not be of any value except within a very narrow range. It cannot be 
assumed that the results will apply to other age groups, to other breeds, 
or perhaps even to other litters, as it may easily be that the litter selected 
is peculiar in some respect with regard to the reaction of the individuals 
of the litter to milk in the diet. No amount of mathematical knowledge 
will help the investigator over the difficulty encountered here, of setting 
up an experiment that will not have too great a scojie but will at the 
same time give results that can be interpreted on a fairly wide basis. 
Only his own experience and general knowledge of the problem that is 
to be investigated will give the clue to the correct form for the experi- 
ment to take. In this instance there may be one breed of pigs that is 
predominant in the area in which the investigator is interested, and con- 
sequently it is quite justifiable to confine his experiment to this breed. 
Again, there will be a definite range in age at which farmers will be con- 
cerned with feeding milk, and only this range need be represented. It 
will not be wise, however, to use only pigs from one litter; in fact it 
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would seem to be desirable to have as many litters as possible represented 
in order that the experimental material will be representative of pigs as 
a whole in the area in which they are being raised. An obviously de- 
sirable plan will be to take pairs of pigs of nearly equal weight and con- 
dition from a number of litters, assign the alternative diets at random 
to the members of each pair, and then feed the pigs individually so that 
individual records may be kept of food eaten and gains made. 

4 . Replication and the Control of Error. The value of replication 
in experimental design is easily understood. In the first place, replica- 
tion increflUM the accuracy and scope of the experiment; in the second 
place, it eniibles us to determine the magnitude of the uncontrolled varia- 
tion that is usually referred to as the error; and in the third place it 
allows for designs that give us an effective control over error. The in- 
crease in accuracy due to replication is expressible in terms of a mathe- 
matical equation. In Chapter II, Section 3, we noted that the standard 
deviation of a mean is reduced in proportion to the square root of the 
number in the sample. In ordinary experiments any one treatment is 
represented by a sample which is made up of one unit in each replication. 
Therefore in general the accuracy of an experiment, as expressed, by the 
standard error of a mean of any one treatment, is increased in proportion 
to the square root of the number of replications. This statement should 
not be interpreted to mean that results of twice the value are obtained 
by multiplying the replications by 4. This depends on what we mean 
by the value of the results. In terms of work done or energy expended 
on an experiment to bring about a given reduction in the standard error 
this is true, but it may be that the expenditure of additional energy in 
order to increase the accuracy of the experiment is unnecessary, in which 
case the value of the results is not enhanced. More will be said on this 
subject later; but for the present we should note that replication is the 
primary tool at our disposal for increasing the accuracy of the experi- 
mental results. 

Another phase of the increased accuracy due to increased replication 
arises from the distribution of t for different' degrees of freedom. From 
Table 94 we note that, for 1 degree of freedom, t at the 5% point is 
12.706 while for 60 degrees of freedom the corresponding value of t is 
2,00. In the first case a much larger difference would be necessary to 
represent a significant effect than in the second case. In a paired ex- 
periment the number of degrees of freedom available for estimating the 
error of the experiment is equal to 1 less than the number of pairs. Sup- 
pose then that we have one experiment with 3 pairs and another one 
with 10 pairs. For the first experiment we would require for significance 
a difference that is 4.30 times the error, and for the second experiment a 
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difference that is 2.26 times the error, these being the values of t at the 
5% point for 2 and 9 degrees of freedom respectively. It is important 
for the beginner to note carefuUy that this increase in accuracy due to 
increased replication is entirely distinct from that discussed above which 
results from dividing the standard error of the experiment by the square 
root of the number of replications in order to determine the standard 
eiTor of a mean. Both factors act together and in the same direction 
but they arise from different sources. 

The manner in which replication increases the scope of the experi- 
ment will be evident from the discussion of Section 3. In the example 
discussed there it was decided purposely to make the replications some- 
what different, in order that the results might be of general application. 
The importance of this is sometimes overlooked, and we will find field 
plot investigators looking for an exceptionally uniform patch of soil on 
which to carry out an experiment and putting all the replications on this 
^me patch. No criticism is offered of this procedure provided that the 
investigator is not under the impression that by doing so he is necessarily 
improving the experiment. Within each replication it is desirable to 
have as. much uniformity as possible, but between the replications it 
does not improve matters to have a great deal of uniformity; and from 
the standpoint of increasing the scope of the experiment it may even be 
harmful. To put these ideas into concrete form let us assume that two 
soil jrgatments am being compared in paired plots. On the field that 
is available for the experiment there are several types of soils, and we 
shall assume for the purpose of argument that all the soil types are 
present that occur in the area for which the results of the experiment are 
to apply. The investigator has three choices. The pairs of plots can 
be placed all on one soil t3rpe, an equeJ number of pairs on each type, or 
at random over the field. Placing the pairs all on one soil type and 
close t9gether in the field has in its favor compactness and economy of 
space; but the results obtained on the one type of soil may not apply 
to the other types, and consequently to get full information on the 
problem a separate test must be planned for each condition. This 
may be beyond the scope of the facilities of the investigator, so he turns 
his attention to the other possibilities. Placing an equal number of 
pairs on each soil type has decided advantages. For example, if there 
are at least four pairs in each location it is possible to regard each set 
as an individual but very rough experiment, capable of yielding an 
approximate measure of the particular reaction of the two treatments 
on the soil type represented. The average 3delds of the two treatments 
over the whole field will, however, be representative for the whole area 
in which the results liie to be put to practical use only if in that area 
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there are about an equal number of acres belonging to each type. This 
statement, of course, implies that the treatments will give different 
results under the various substratum conditions, but experience tells us 
that this is very likely to be the case. We turn now to the third method, 
that of randomizing the pairs of plots over the whole field. The process 
of randomization will ensure that the various soil conditions represented 
in the field will have an equal chance of being used in the experiment. 
As nearly as possible, therefore, we are obtaining a random sample of 
the infinite population for which we are endeavoring to obtain an un* 
biassed estimate of the difference between the two treatments. The 
only possible criticism of this method is that some of the soil types will 
not be represented, and hence certain information will be lost. The 
answer is that with a given type of experiment we cannot perform two 
functions at once. Without enlarging it considerably we cannot design 
an experiment that will give us a general average result for the whole 
area under consideration, and at the same time give us detailed informa- 
tion on the reactions of the treatments under varving conditions. In- 
formation regarding the whole area is not lost, but gained, by placing 
the pairs at random and perhaps missing some of the types. . On the 
assumption that the field is representative of the larger area being sam- 
pled it gives us a more correct measure than if we assumed without 
proper information that each of the types is equally represented 

This somewhat theonjtical discussion does not bear precisely on the 
practical problem with which the investigator is faced, because it is im- 
possible to obtain a field that is really representative of a large area. 
However, it serves to bring out some very important points that may 
be put into practice in tests of this kind. Any investigator who gives 
the problem serious thought will take note of the limitations of one test 
carried out under very uniform conditions, and at the same time will 
realize the importance of replication in widening the scope of field plot 
experiments. 

The second important function of replication is to enable us to obtain 
a measure of the experimental error. This follows directly from the 
principles of the t test. If there is only one plot of treatment A and 
one of B there can be only one difference, and the number of degrees of 
freedom available for estimating the standard error is zero. In non- 
statistical terms there is only one value, the difference between the two 
plots, and this difference is the only measure we have of both soil varia- 
tion and the effect of the treatments. We cannot compare a difference 
with itself; therefore, we say that there are no degrees of freedom avail- 
able for estimating the error of the difference. This defect in an experi- 
ment is obviously overcome as soon as we introduce replication. Even 
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if we have only two plots of A and two of B we have at least one degree 
of freedom available for estimating the error, and by means of the t 
test an unbiassed comparison of the treatments can be made. 

The third function of replication has to do with the control of error. 
Another hypothetical example will make this clear. Again we can sup- 
pose that two soil treatments are being compared in paired plots. The 
.measure of error is determined from the variation in the differences 
within the pairs. Suppose now that the plots are all distributed at 
random over the field, and the pairs are made up simply by taking the 
two plots of A and B that happen to fall together in another random 
selection. This can have only one effect, and that is to increase the 
variability of the differences, and consequently the accuracy of the test 
is reduced. A question that may be asked here is whether or not the 
method that increases the variability of the differences will also increase 
the average difference between the two treatments. Yes, tlie average 
ijifference will also be increased but it must be remembered that this is 
due in actual practice to two components. A part is due to the real 
difference between the treatments and a part to the variability of the soil. 
The latter component will be increased in the same proportion as the 
error, but the former will not, and consequently the precision of the 
experiment becomes correspondingly less as the error component 
increases. 

The benefits to be obtained from the arrangement of treatments in 
replications wherein each replication contains one of each of the treat- 
ments is fairly well known to experimentalists, especially in agronomic 
research. Variety trials are therefore arranged in compact blocks so 
that the plots within the blocks are as nearly alike as possible. There 
are, of course, many applications of the same principle in other types of 
experimentation; but this subject will be discussed more fully under the 
heading of ^e analysis of variance. 
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LINEAR REGRESSION 

1. General Observations. In the previous discussions emphasis was 
placed on the variations that occur in any one variable, such as the yield 
of wheat plots, the weight of animals, or the height of students. Some- 
times the values of one variable are classified in two or more ways, in 
which case we may be interested in the joint variation of the pairs or 
groups of values so formed. For example, in Chapter V a problem was 
discussed in which pairs of plots of two varieties were arranged in differ- 
ent ways over a field. The interest there was largely in the differences 
between the members of pairs, but it was also pointed out that if the 
plots were close together they would tend to yield alike, or in other 
words they would vary together. The present chapter, however, deals 
with examples wherein there are paired variates but of two different 
kinds of variables, and in general one of the variables may be regarded 
as independent and the other as dependent. In a study of the effect of 
rainfall on yields of field crops, we would have a typical example of a 
dependent and an independent variable, in that the interest would lie 
in the degree to which rainfall, acting at. an independent variable, would 
have an effect on yield, the dependent variable. It would be useless, 
of course, to think of this problem in any other terms, as we could not 
imagine the yield of field crops having any effect on rainfall. 

It is not difficult to see that, for any set of data for paired variates, 
it should be possible to obtain a measure of the physical relation between 
the two variables. Suppose that the data are arranged as in Fig. 6, 
which shows graphically the average yields of groups of plots of Marquis 
wheat for given percentages of infection with stem rust. It would not 
be difficult to draw a straight line so that it would represent the general 
trend of decreasing yield with increasing percentages of infection, and 
we could then read off the approximate decrease in yield for a given 
increase in infection. This, of course, would be a very crude method, 
as the fitting of the line would be purely a mattei^ of eye judgment and 
different individuals would place the line in slightly different places. 
Then to develop from the graph a general expression for the relation 
between the two variables, from which the line could be reconstructed 
at any time and which could be used for predicting the effect on yield 
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of given percentages of infection, it would be necessary to draw out the 
graph very accurately and make an average of a number of measure- 
ments. In order to arrive at a more precise method of fitting the line, 
recourse is had to the “ method of least squares.” This means that 
a line is fitted such that the sum of the squares of the deviations of the 
points in the graph from the straight line is a minimum. It pves us 
a statistic known as the regression coefficient, which expresses the in- 
crease or decrease in the dependent variable for one unit of increase 
in the independent variable. From the regression coefficient we can set 
up a regression equation, which can be used to make predictions; and 
also it defines the straight line known as the regression straight Itna, 



Fiq. 6. — ^Regreaaion graph for 3aelda of Marquis wheat on degree of 
infection with stem nut. 

The essential difference between the treatment of different kinds of 
variables that are thought to be related and pairs of variables that 
merely vary together will now be clear. In the first case our concern is 
to determine a function, in the present case a straight-line function, that 
will express the average relation between the two variables. In the 
latter case the fimction will obviously not be of very much value; we 
will probably be better satisfied with some expression giving the com- 
bined effect of the variables on each other or perhaps, if we cannot think 
in such terms, the degree to which both variables are acted upon by 
outmde influences that cause them to vary together. Of this second 
condition we eball learn more in the next chapter. 

8. Fitting tiiie R^^ression Une. Let the two variables be represented 
by X and y, where x is independent and y dependent. Then, if the 
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relation between x and ^ can be represented by a straight line, the equa- 
tion of the line will be of the form: 

Y = a + bx (1) 

where a and b are constants and Y represents the values of y estimated 
from the equation. For any one value of x, say x„ the corresponding 
value of y estimated will be F., and the error of estimation will be^ 
(y, — Yt). The value of would be represented on the graph as in 
Fig. 6 by one of the points, and the corresponding estimated value Yi 
would be a point on the straight line. To fit the line, it is required that 
the sum of the squares of the errors of estimation 2 ( 1 / — Y)^ shall be a 
minimum. It is best to begin with x and y measured from their means, 
80 that our regression line is actually: 

(F - y) = a + b (x - x) (2) 

whence the error of estimation is given by S[(F — g) — = 

2(y — F)2, the same as before. Minimizing by the method of least 
squares for 2(y — F)^, we obtain the equations: ^ 

Na + 2(x — x)b = 2(y — yj 
2(x — £)a + 2(x — x)^b = 2(y — g)(x — £) 
and solving we have: 

a = 0 


Z(y -y){x- x) 
2(x - x)2 


( 3 ) 


In equation (3) we note the expression 2(j/ — g)(x *- x), which is 
usually referred to as the sum of products. For two variables, it is the 
expression that corresponds to the sum of the squares of the deviations 
from the mean for one variable. We know that the variance for a single 
variable is given by: 

^ 2(x - x)^ 

- 1 


and now we learn that the covariance for two variables is given by: 

^ For the method-of-least-squares technique see any good textbook on elementary 
calculus. If It IB confusing to apply these methods to expressions containing the 
summation sign, 2, write out one or two sets of values and proceed with them con- 
secutively. The procedure for the entire Bet of valueBBummated will then be clear. 
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In (3) if the numerator and denominator are divided by N 
equation becomes: 

Covariance {xy) 

0 * ■■ 1 ■ . ■ — 

Variance (z) 

Going back now to (2) above: 

Y ^ y = bU — x) 
and: Y = y + b(x — x) (6) 

or: Y = (y — bx) + bx (7) 

the last being the form in which this expression is most frequently used. 
It is known as the linear regrension egucUioUf and b in the equation is the 
regression coefficient. 

3. Properties of the Regression Coefficient. In the equation 
Y = § + 6(a; — i), fe expresses the probable relation between x and y 
"in terms of the values in which x and y are measured. The coefficient in 
this equation is usually represented as byx, which means that it is the 
regression coefficient for the regression of y on and thus in any sample 
of paired variates studied it represents a kind of average of the increase i n 
y for a given in crease in x. Thus if y is^b’^hels jper acre and x is ^ns of 
f ertilizer applied, byx is a n estimate ofthc increase in yield to be expected 
f rom one ton of fertilizer. 

For every example where we study the regression of y on x, there is 
also the theoretical possibility of stud3dng the regression of x on y; but as 
stated above the theory of linear regression is best confined to examples 
where we can think clearly in terms of the effect of one variable on the 
otheri and consequently the investigator is concerned with only one 
aspect of the regression. 

The regression coefficient is a measure of the slope of the regression 
line, but only relative to the class values of the two variables and their 
range of variation. Suppose that, in a study of the effect of rainfall on 
yield, the rainfall varies from 0 to 9 and the yields from 20 to 30, and 
the mean yield is 25 and the mean rainfall 5. In a graph such as Fig. 6 
the units could be of the same length for the two variables, and if the 
regression coefficient is 1 the regression line would go from one diagonal 
to the other and would have a slope of 1 ; that is, it would lie at an angle 
of 45 degrees. However, if rainfall varied from 0 to 20 the slope would 
be less than 1, even for the case where yield is completely dependent on 
rainfall. 

4 . Tests of Significance of the Regression Coefficient. The sam- 
pling error of the regression coefficient is related to the error of estimation 


- 1 the 
( 5 ) 
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measured by ZCy — F)*. 
given by: 


Thus we have the standard error of estimate 


l^iy- 


- lOf 
2 


( 8 ) 


and the standard error of the regression coefficient by: 

«» = sjV'Zix - *)* (9) 

The value of Z(y — YY can best be calculated by equating it to 
- y)* - - f); or S(y - y)2 - b2(y - y)(a; - £), depend- 

ing on which form is the more convenient at the time. In these equali- 
ties it is understood that the regression coefficient is by.. 

Then to make the test of significance i is given by: 

i (10) 

Sb 8, 

and the table of / is entered under N — 2 degrees of freedom. There are 
jy — 2 degrees of freedom because both y and by. are statistics calculated 
from the sample. 

The test for the significance of the difference between two regression 
coefficients is based on their respective standard errors. For the two 
regression coefficients bj and b 2 , with standard errors calculated as in (9) 
above, the standard error of the difference would be: 


and 


(&i - &2) 
fil-2 


( 11 ) 

( 12 ) 


The two coefficients may be calculated from different numbers of paired 
values, so that there would be a total of (A^i — 2) + (^^2 — 2) degrees 
of freedom available for the comparison of the coefficients, where Ni 
and N 2 are the numbers of pairs respectively from which bi and b2 are 
calculated. 

A special case arises when there are two sets of values of the depend- 
ent variable. If these are yi and y 2 , there are two regression coefficients 
by,, and by,.; and it may be necessary to test the significance of the 
difference between them. The simplest and most direct method is to 
form a new variable from (yi — ^ 2 } and calculate bcy^.y^., which may 
be tested in the ordinary way. 

6. Methods of Calculation. It will be remembered from formula 
(3) that the numerator of the regression coefficient is the sum of products 
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of the deviations from the means of the two variables, and is expressed 
algebraically as S(i/ — — £). The denominator of the coefficient 

is the already familiar sum of squares of the deviations from the mean, 
for the independent variable usually indicated by x. Our problem, 
then, is to learn the most convenient method of calculating the sum of 
products. The method follows from the identity: 

2(y - yKx -x) = S(ay) - (13) 

where X{xy) is the sum of the products of the onginalValues of x and 
taken of course by pairs, and Tx and Ty are the totals for all the original 
values of x and i/, respectively. The latter are somewhat more conveni- 
ent symbols for the familiar Z(x) and Given a series of paired 

values, therefore, for which a regression coefficient is to be calculated, 
the first step is to determine Tx and Ty. Then each value of x is multi- 
' plied by each value of y (or vice versa), and the sum of the products 
accumulated in the machine. This gives us 2}(xy), and if we subtract 
from thjs TxTy/N, the remainder is the required sum of products of the 
deviations. S(a; — x)^ is, of course, calculated in the manner indicated 
in Chapter II. 

In many examples the labor of calculation can be reduced by coding 
the data. This involves either subtracting a uniform quantity from the 
values of each individual variate or dividing by a constant quantity, or in 
certain cases both devices are employed at the same time. Supposing 
that the actual values arc as given below on the left; the values on the 
right are examples of how the coding may be carried out. 

Uncoded Coded 

X y X y 

2402 2785 240 278 Dividing by 10 and rounding off last figure. 

40 78 Subtracting 200. 


198 196 8 6 Subtracting 190 from each value. 

196 193 5 3 


256 274 56 74 Subtracting 200. It is quite permissible to 

229 198 29 —2 have negative values, but usually they compli- 

cate the calculations slightly and if a maclfine 
is available for calculation most workers avoid 
them. 

The regression coefficient having been calculated, the next step is to 
determine the regression equation, 7 — hi?) + bz. The portion 
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{g — b£) is constant and is computed once and for all. Putting the 
result for this portion equal to a, we have the working equation: 

K = a + (14) 

from which all the Y values that are necessary can be obtained. 

It must be remembered that, if the regression equation is calculated 
from coded data, the resulting equation itself must be decoded before it 
can be used for prediction purposes. If the data have been coded by 
subtraction only, the only correction required is to the means of x and y 
and this correction must be made while the equation is in the form given 
in equation (7). If in the coding the x and y values are divided by a 
different constant value, then a correction must be made to the regression 
coefficient as well as to the means of x and y. For example, if x has been 
divided by A and y by B, then the regression coefficient calculated from 
the coded data must be multiplied by fi/A. 

Example 10. Calculation of the Regression Coefficient and Regreission Equa- ' 
tion from a Small Series of Paired Values. In a hypothetical example the values 
from 10 pairs of variates are as given below. 

987766331 1 r, «50 

. .. 998665431 iry>= 62 

Values for the totals are given at the end of each line and N ~ 10. To find the sum 
of products, and the sum of squares of x, we proceed as follows: 


2(®y) = (9 X 9) + (8 X 9) + (7 X 8) + • + (1 X 1) - 336 0 
TJTt/N - (50 X 52)/10 - 260 0 

Difference = 2(y — y)(* — f) 

= 75 0 

2(**) - 9* + 8* + 7* + 7* + 6* +■ 
r.VAT - 50*/10 

• • + 1* = 324 0 

* 260 0 

Difference *= 2(x — £)^ 

- = 74 0 

Then = 76 0/74 0 = 1 014 f » 60/10 

= 5 0. y = 62 0/10 =5 2. 

Also o « (5 2 - 1 014 X 5 0) = 0 13. 


Finally the regression equation is F = 0 13 + 1 014x. 


In order to use this equation for predicting values of y from given values of x, 
it IS only necessary to insert the required value for x and determine the resulting 
value of Y. For example, if w'e take x equal to 2 the calculated value of F is 
0 13 + 1 014 X 2 * 2.158. 

Example 11. Calculation of the Regression Coefficient and Regression Equa- 
tion from a Large Senes of Paired Values. When dealing with large numbers of 
variates, we found that it was convenient to make up a frequency table in order to 
summarize the data and reduce the labor of calculating the mean and the standard 
deviation Similarly, in regression studies, if a large series of paired values is avail- 
able it 18 desirable to make up a table which is a combination of the frequency dis- 
tnbutioDB of the two variables. From long usage such a table has become known as 
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a correlation table, and we shall see in the next chapter that it is likewise of value for 
calculating the correlation coefiicient. 

To prepare a correlation table the best plan js to copy the paired values on cards 
of a size that can be handled conveniently. Thus, if we decided to make up a table 
for the yields of plots in adjacent rows of Table 4, Chapter II, we would make our 
cards as foUows: 



x 185 


X 169 

First card 


Second card 



y 162 


y 205 


and proceed until all the pairs had been entered. After deciding on the class values 
in very much the same manner as described m Chapter 11, Section 5, we would dis- 
tribute the cards for one of the variables and then distribute each pile for the second 
variable. Table 12 is the final result of distributing all the cards for the yields of 
adjacent plots as taken from Table 4. The classes here are somewhat larger than 
they should be, in order to save space and to make the table more convenient to use 
as an example The cards were first distributed for j, giving the frequency distnbu- 
tion as shown in the last row of the table The 4 cards falling in the first class were 
then distributed in the vertical column according to the values of y, and so on for 
each pile. When all the piles were distributed, the cards in each small pile were 
counted, and the frequencies entered m the table Notice also that the natural num- 
bers have been inserted in the table to replace the class values. This is the device 
introduced in Chapter II for reducing the labor of calculating the mean and standard 
deviation from frequency tables. It may be used here in the same way, in order to 
reduce the labor of calculating the regression coefficient. It will be noted that this is a 
form of coding, and consequently the regression coefficient and the regression equa- 
tion will require correction if they are calculated from a table of this kind. 

The next step is to prepare Table 13, in which the first four columns are entered 
directly from the correlation table For the column headed totals for y arrays '' 
we proceed to obtain the totals for each array as foUows, where the first array of p is 
the distribution in the y classes of the variates that fall in the first class for x. 

1st array (2 X 3) + (1 X 6) + (1 X 8) « 20 

2nd array (2 X 3) + (4 X 4) + (6 X 5) + (1 X 6) + (1 X 7) - 60 

The total for this column is obviously Ty. the grand total of y. In the same way we 
proceed to obtain the totals for the x arrays and 7*x, the grand total of x. There are 
two columns headed the object being to calculate ^{xy) in two ways so as to 
have a complete check on the calculations. The entries m these columns are obtained 
by multiplying the totals for the y arrays by the corresponding class values of x, and 
the totals for the x arrays by the corresponding class values of y. Summating at the 
foot of the columns we obtain ^{xy). 

Finally from the correlation table we have to calculate 2(x^), and the method is 
the same as in Chapter II for any frequency distribution. Tabulating our calcula- 
tions we have: 

- 5448 
r(x*) -3952 
Tx - 850 

N - 200 
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TABLE 18 


CAUniLATION OF TBB BXQRXaBION CoEFFlCIBNT 


1 

Frequency 

y 


Frequency 

X 

Totals 
for y 
Arrays 

2(*v) 

Totals 
for X 
Arrays 



1 

1 

4 

20 

20 

4 



1 


13 

60 

120 

3 



6 


43 

243 

729 

12 


4 

14 


48 

307 

1228 

48 

192 

5 

34 

5 

59 

387 

1935 

132 

660 

■1 

63 

6 

27 

187 

1122 

228 

1368 

WM 

55 

H 

6 

42 

294 

247 

1729 

8 


■ 




143 

1144 

9 

4 





21 

189 

10 

2 

■ 




12 

120 

ft. 

200 

.V 


1 


5448 

S(xy) 

i 

1 

1 

6448 

^(xy) 


In order to set up the regression equation, the means of x and y are required. These 
are £ ~ 850/200 » 4 25, and y ^ 1246/200 » 6 23, and the regression equation is 
written: 

F = (6 23 - 0 4492 X 4 25) -h 0 4492* 

* 4 3209 - 0 44922: 

Since the regression equation has been calculated from coded values, the necessary 
corrections must be applied To correct the lueans we apply formula (7), Chapter 1 f , 
obtaining; 

y * (6 23 - 1) X 23 + 31 = 151 29 
£ - (4.25 - 1) X 23 + 77 = 151 75 

Since the class value is 23 for both variables, the regression coefficient docs not 
require any correction, so the new equation is: 

Y = (151 29 - 0 4492 X 151 75) - 0 4492x 

»» 83 12 - 0 4492x 

In order to plot the regression straight line, we require only two points on the graph, 
preferably as far apart as possible. It is simpler to use the codeil regression equation 
to find any values of Y required, and also the graphing may be done in the coded 
values and the actual values inserted when ever> thing is completed. The end 
points of the line are 

Fi - 4 3209 - 0 4492 X 1 - 4 77 
Fa - 4 3209 - 0.4492 X 7 - 7.46 
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The graph is finally as in Fig 7. If such a graph is required in the final presentation 
of the results, it would be necessary only to substitute the actual class values for the 
assumed values. The moans of the y arrays are, of course, obtained by dividing the 
totals for the y arrays by the corresponding frequencies These may be convei*ted 

X 


Means of 
y arrays 
5 00 
Y 4 02 

5 05 

6 40 
G.50 

6 03 

7 00 


12 3 4 5 6 7 

YIELD 

Fia 7 — Regression graph for yields of adjacent plots showing regressicjn 
line and means of y arrays 



to actual values by means of the formula for correcting means as described in 
Chapter Jl, and used above for finding the true values of i and fj 
To teat the sigmficanco of the regression coefficient we finil 


Si 


4 


- y) 


- xf 

N - 2 


4 


4 17 42 - 0 -1402“ X 339 50 
198 


- I 3275 


h\'X(x - xy _ 0 4492 m 
s. " 1 3275 


6 23. 


from which it is clear that the regression coefficient is highly significant 


6. Exercises. 

1. Table 14 gives the results obtained in an expeiiment with 25 wheat varieties 
on the number of days from seeding to heading and the number of days from seeding 
to maturity. Calculate the regreasion equation for the regression of days to mature 
on days to head, and test tho significance of the regression coefficient. Code the 
data before beginning your calculations by subtracting 50 from the days to head 
and 85 from the days to mature Find the fiducial limits at the 6% point of the 
regression coefficient, and decide as to the practicability of using days to head to 
replace days to mature on the biisis of the data provided by this sample. 

Regression coefficient = 105.23/125.68 * 0.8373. (Coded data.) 
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TABLE 14 


Data on Days to Hsad and Days to Mature of 25 Whb\t Varieties 


Variety 

Days 

to 

Head 

Days 

to 

Mature 

Variety 

Days 

to 

Head 

Days 

to 

Mature 

1 

60 0 

94 4 

14 

58 2 

92 4 

2 

53 6 

89 0 

15 

58 0 

91 6 

3 

59 0 

94 0 

16 

59 4 

94 0 

4 

61 8 

95 4 

17 

55 4 

90 8 

5 

53 8 

88 2 

18 

61 6 

95 2 

6 

57 8 

93 4 

19 

63 0 

97 2 

7 

57 8 

93 6 

20 

60 2 

94.6 

8 

58 4 

92 0 

21 

61 6 

06 0 

9 

57 8 

92 8 

22 

57 6 

92 6 

10 

59 0 

93 4 

23 

60 8 

95 4 

11 

59 2 

93 8 

24 

61 2 

94 4 

12 

59 0 

92 8 

25 

58 2 

94 0 

13 

58 6 

94 2 





2. Table 15 contains data on the carotene coiiicnt detcrminccl by two niolhods for 
139 wheat varieties By one method carotene was determined on the whole wheat, 
and by the other method, on the flour. Tlie figures for carotene in the wheat are 
lower than for carotene in the flour, which is of course the reverse of the actual 
condition. This was due to a different method of extraction used for the whole 
wheat which gave lower but relative results 

Make out cards, one for each pair of values, and prepare a correlation table, 
letting the flour carotene represent the dependent variable y In order to reduce 
the labor of calculation make the classes fairly large; for example, let the first class 
for X be 0.85 to 0 95, and the first class fc’^ ^ be 1 33 to 1 49 Also do not forget to 
replace the actual class values by the natural numbers, beginning at 1, before going 
ahead with the calculations. Determine the regression equation and prepare a 
graph similar to Fig. 7. 6^* = 438 39/665.98 = 0 6583 (Coded data.) 

3. Prove: (a) S(y — y)(x — f ) — Z{xy) — TaT^fN, 

(6) S(v - r)* = 2(5/ - »)* - - *)*. 
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TABLE 15 


Cabotbnb Content of Floor and Whole Wheat for 139 Varieties 




Bi 






Carotene 
in Wheat 


2 39 

1 . 

18 

48 

1.71 

1 

16 

95 

1.97 

1 

33 

2 

3 11 

2 

13 

49 

1 93 

1. 

14 

96 

1 83 

1 

14 

3 

2.15 

1. 

41 

60 

1.81 

1. 

30 

97 

2.00 

1 

51 

4 

1.96 

1 

42 

51 

1.89 

1. 

32 

98 

1 96 

m 

28 

6 

2 02 

1. 

50 

52 

1 65 

1 

32 

99 

2 00 

n 

33 

6 

1.76 

1 

25 

53 

1 93 

1. 

28 


2 02 


32 

7 

2 10 

1 

65 

54 

2 12 

1 

48 

101 

1 78 


17 

8 

2 12 

1 

24 

55 

2 25 

1 

50 

102 

1 83 

1 

10 

9 

2 28 

1 

48 

56 

1 92 

1 

42 

103 

1.03 

1 

22 

10 

1 86 

1 

35 

57 

2 25 

1 

66 

104 

2.14 

1 

44 

11 

2 60 

1 

58 

58 

2 25 

1 

63 

105 

2 15 

1 

54 

12 

2.11 

1 

45 

59 

1 65 

1 

18 

106 

2 13 

1 

46 

13 

2 30 

1 

74 

60 

1 63 

1 

14 

107 

1.97 

1 

40 

14 

1 80 

1 

42 

61 

1 70 

1 

22 

108 

1 83 

1 

11 

15 

2 00 

1 

45 

62 

1.61 

1 

20 

109 

2.10 

1 

40 

16 

2 05 

1. 

87 

63 

1 83 

1 

33 

110 

1 84 

1 

19 

17 

2 09 

2 

00 

64 

1 60 

1. 

13 

111 

1 98 

1 

39 

18 

2 33 

1 

65 

65 

1 37 


92 

112 

2 31 

*1 

60 

10 

2 29 

1 

64 

66 

1 96 

1 


113 

2 29 

1 

53 

20 

2 30 

1 

62 

67 

1 88 

1 

26 

114 

2 15 

1 

45 

21 

1 97 

1 

55 

08 

1 92 

1 

34 

115 

1 96 

1 

44 

22 

2 36 

1 

68 

69 

1 89 

1 

EZB 

116 

1 08 

1 

40 

23 

1 73 

1 

32 

70 

1 99 

1 

26 

117 

1 80 

1 

30 

24 

1 72 

1 

47 

71 

1 82 


98 

118 


1 

33 

25 

1 70 

1 

53 

72 

2 12 

1 

31 

119 


1 

42 

26 

1 63 

1 

50 

73 

2 16 

1 

16 


2 06 

1 

44 

27 

1 93 

1 

48 

74 

2 14 

1 

04 

121 

1 96 

1 

36 

28 

1 50 

1 

25 

75 

1 63 


88 

122 

2 07 

1 

38 

29 

1 77 

1 

33 

76 

2 76 

1 

91 

123 

2 24 

1 

51 

30 

1 60 

1 

40 

77 

2 07 

1 

20 

124 

2 15 

1 

38 

31 

2 31 

1 

.49 

78 

1.67 

1 

07 

125 

1 83 

1 

18 

32 

2 17 

1 

42 

79 

2 78 

1 

80 

126 

1 84 

i 

20 

33 

2 10 

1 

35 

80 

3 40 

2 

02 

127 

2 03 

1 

45 

34 

2 90 

1 

58 

81 

3 67 

2 

10 

128 

1 87 

1 

05 

35 

2 17 

1 

50 

82 

2 41 1 

1 

61 

129 

2 24 

1 

.44 

36 

2.15 

1 

.40 

83 

2 23 

1 

38 

130 

2 14 

1 

06 

37 

2 01 

1 

40 

84 

3 07 

1 

93 

131 

2 13 

1 

10 

38 

2 35 

1 

67 

85 

2 22 

1 

44 

132 

2 03 


98 

30 

2 34 

1 

62 

86 

2 55 

1 

58 

133 

2 25 

1 

31 

40 

2 00 

1 

47 

87 

2 12 

1 

39 

134 

2 33 

1 

08 

41 

2 18 

1 

.55 

88 

1 94 

1 

27 

135 

2 01 

1 

14 

42 

2 47 

1 

.73 

89 

1 95 

1 

41 

136 

1 89 

1 

41 

43 

2 25 

1 

.62 

90 

1 59 

1 


137 

3 00 

2 

20 

44 

1.77 

1 

.39 

91 

2 00 

1 

30 

138 

2 16 

1 

.73 

45 

1.68 

1 

34 

92 

1 77 

1 

.22 

139 

2.29 

1 

.61 

46 

2.46 

1 

29 

93 

1 98 

1 

.26 





47 

1.86 

1 

.28 

94 

1.07 

1 

.30 


















CHAPTER VII 


CORRELATION 

1. Covariation. This is a term that is very expressive with respect 
to the fuiidame^ital situation regarding two variables, from which the 
methods of correlation arise. In the previous chapter it was pointed 
out that, when two variables are so related that one may logically be 
considered as being dependent on the other one, the method of regression 
IS completely applicable to a study of this relation; but when the two 
variables cannot be considered in the light of dependence and inde- 
pendence, the method of regression does not appear to be satisfactory. 
Suppose that a study is to be made of the relation between the heif^ts of 
brothers and sisters. It would not be logical to consider the height of 
one member of the pair as being dependent on the height of the other 
one, yet we may be fairly certain that there is a relation of some sort 
and we may wish to estimate what this relation is. Tlie question that is 
asked with respect to two such variables seems to be this. ‘'To what 
extent do the heights of brother and sister vary together”? Thus we 
have the term covariation, and the conventional statistic for the measure- 
ment of covariation is the ccrrelatton coefficient 

2. Definition of Correlation. In Table 16 there are three sets of 
figures that may be taken as measurements on two variables that we 
shall designate as x and y. On examining these three sets of values it 
will be noted that the relation between x and y is different in each case. 
In set 2 we have high values of x associated with high values of y, and 
in set 3 we have high values of x associated with low values of y. In 
both cases there is an obvious relation but one is the reverse of the other. 
In set 1, on the other hand, there is no apparent relation between the 
two variables. These sets may be regarded as samples from infinite 
parent populations of paired variates. In the population from which 
set 2 is drawn, whenever a pair of variates is selected, we expect to find, 
if the pair contains a high value of x, that there will be a high value of y 
associated with it. In the population represented by the sample in 
set 3 it is to be expected that high values of x will be found associated 
with low values of y. These two opposite situations are referred to as 
positive and negative correlation. Set 1 represents still another situa- 
tion. High values of x do not appear to be associated with either high 
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TABLE 16 

Three Samples of Paired Variates iLLusTRATiNa the 
Phenomenon of Correlation 

Setlj;... 7716638931 Total » 50 

y 5961319468 Total - 52 

Set2 X 9877 63311 Total » 50 

y 9986 54311 Total » 52 

SetSx 1133567789 Total » 50 

9986654311 Total 52 

or low values of y In other words, we shall expect that in the parent 
population the two variables vary independently. A graphical picture 
of the results with these three samples is given in Fig 8 For each 
sample we have prepared what is usually known as a dot diagram The 
values of y are represented as ordinates and the values of x as abscissae, 
so that each pair can be represented by a dot on the diagram The fina'i 


SET I SET 2 



I 23456769 


X 

Fig. 8. — Dot diagrams for the sets of values given m Table 16. 

result is a figure which represents in a general way, by the scatter of the 
dots, the relation between the two variables For set 1 the dots are 
scattered more or less uniformly over the whole surface. For sets 2 
and 3 there is a definite relation between the variables, as shown by the 
tendency for the dots to arrange themselves in a straight line along the 
diagonals of the square. We are reminded here of the regression graphs 
of the previous chapter The difference is that we are not now studying 
the effect of one variable on the other, but rather the degree to which 
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the variables vary together owing presumably to influences that are 
common to both. If such measurements represented heights of brothers 
and sisters^ it is apparent that this common influence might be the simi- 
larity of their genes. 

This rou^h illustration is sufl^cient to give a general idea of the nature 
of correlation, but it is not adequate to give a complete picture of cor- 
^relation as it occurs in nature. The student who is specially interested 
in this subject should make a thorough study of the references given at 
the end of this chapter. Each writer on this subject presents the situa- 
tion in a somewhat different manner, and after a study of several view- 
points the student will begin to grasp the fundamental points very 
clearly. We are concerned here mainly with the viewpoint that cor- 
relation is a measure of the degree to which two variables vary together, 
as we believe this to be the most useful viewpoint from the standpoint 
of the research worker. Since we have become acquainted w’ith the 
variance and the standard deviation as measures of variability, it is of 
interest now to inquire how the combined variation of two variables can 
^be measured, and how much of the variability of one variable is tied up 
with the.vanability of some other variable. In the first place, however, 
we must consider a few points that are fundamental to the methods of 
measurement that will be employed. 

The dot diagrams given in Fig. 8 will result from combining the fre- 
quency distributions of two variables Since they represent samples 
only, they give merely an estimate of the combined frequency distribu- 
tions of the two variables in the parent populations. The single or 
umvajiatc distributions are represented by a curve, but the combined or 
bivariate distributions must be represented by a surface. On extending 
the diagrams of Fig 8 to much larger samples it is e^^dent that the dots 
will begin to form into swarms of some definite shape, depending on the 
degree of correlation between the variables. If the correlation is high 
the sw’^arm will evidently be of the greatest density along the diagonal 
of the figure; if there is no correlation the swarm is likely to be almost 
circular in shape. The theoretical bivariate frequency distribution will 
obviously be represented by a volume, in contradistinction to that of 
the univariate distribution which is represented by an area. These 
points give us some clue as to how we may obtain a measure of corre- 
lation. 

3. The Measurement of Correlation. Figure 9 illustrates the shape 
of the swarm in a correlation surface for three different degrees of cor- 
relation. The circular swarm at (a) represents zero correlation. In (c) 
the swarm falls entirely on the diagonal and must represent perfect 
correlation. In (6) we have a condition between the other two extremes. 
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Now each BOiface is divided into quadrants by lines erected at the posi* 
tions of the means, and in each quadrant are plus and minus signs that 
represent the signs of the products of the x and y deviations from their 
means. Thus in the upper right-hand quadrant (1) the deviations of 
X and y are both positive so that the product of the deviations is positive. 
Therefore we have positive products in quadrants (1) and (3) and nega- 


A B C 



Fia. 9.— CorreLation surfaces showing the variation in the shape of the swarm 
with increasing correlation 


live products in quadrants (2) and (4). Now if we obtain the sum of 
the products it is obvious that in (a) the plus and minus products will 
cancel each other and the sum will be zero. In (c) all the products will 
be positive so that their sum will be a maximum In (b) the condition 
is intermediate between (a) and (c). The plus products are greater than 
the negative products; hence we have a positive but not a perfect 
correlation. 

Let us consider now the sets of figures in Table 16. If we calculate 
the sum of the products 2(a: — — y) for each set we should find an 

agreement with the theory outlined above. To carry out these calcula- 
tions we shall make use of the identity; 

S(x - x)iy - 3?) = Z)(iy) - ^ (1) 


where Tx is the total of the x values, TV the total of the y values, and N is 
the number of pairs. Our calculations then come out as follows : 



2(*») 

T,T,IN 

2(x - i)(]i - g) 

Set 1 

262 

260 

2 

Set 2 

335 

260 

75 

Sets... . 

186 

260 

-74 


The result is in perfect agreement with the theory that the sum of 
products is a measure of correlation. 
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The sum of products is an absolute measure of correlation but will 
not serve as a relative measure, since it is dependent on several factors 
that have nothing to do with the correlation between the two variables 
with which we are concerned. It depends on the number of pairs of 
measurements or variates, on the units in which the two sets of variates 
are measured, and on the variability of both of the variables. The first 
objection can be overcome by dividing by the number of pairs of vari- 
ates, and we now find that we have 2(x — 2)(y — which was 

defined in the previous chapter as the covariance cv of x and y. The 
covariance, however, is still not a relative measure of correlation, as it is 
affected by the units of measurement and the variability of x and y. To 
overcome this difficulty it is clear that the covariance must be divided by 
some factor which measures the variability of x and y and is expressible 
in terms of the units in which these variables are measured. The first 
factor which suggests itself is the product of the two standard deviations, 
and this actually gives the formula for the correlation coefficient, usually 
designated by the symbol r. Thus we have: 

. 2(* - i)(y - g)/N 


Another formula can be given using the variances of x and y in place of 
their standard deviations. This must of course be: 


2(» - f)(y - g)/N 


(3) 


where s. is the variance of x and tv is the variance of y. Formula (3) 
shows also the algebraic relationship between the regression coefficient 
bta and the correlation coefficient. Since: 


it follows that: 



and 





and — is obviously the regression coefficient where x is taken as the 

Vy 

dependent variable instead of y. Of course in all regression problems 
there are two regression coefficients, although, in the type of problem we 
have referred to in the chapter on regression, one of these will be of 
theoretical interest only. The correlation coefficient is finally: 


r 


(4) 
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In other words, it is merely the geometric mean of the two regresaon 
coefficients. 

A brief inspection of the formula of the correlation coefficient vrill 
show that it has a maximum value of +1 and a minimum value of —1 
under conditions that we would ordinarily take to represent perfect 
correlation. (1) Let y, = ItXx, where y, and x, represent any pair of 
values of y and x, and ik is a constant. We have therefore a constant, 
positive relationship between x and y. 


Then 

and 

Hence 

Also 

Therefore 
And finally 


iyx — S) = (fcTi — Jfci) = Hxx — i) 
iy* — — *) = — i)* 

S(x -x){y-y) = *S(x - 

s(y - y)^ = - if 


S(x - £)(y - g)/N 


JkS(i - if/N ^ ^ 
kcl V® 


(2) Let yi =— kx,. Here we have a constant negative relationship 
between x and y. Then 


and 

Hence 

Also 

Therefore 

Finally 


(Vi — y) =— — kx) = — k(x, — i) 

(!/. - - s6) = - k{Xi - i)* 

2(x — x)(y — y) —— k2(x — if 
SCy — yf = k®2(x — if 

Vy = Aw, 

2(x - i)iy - g)/N _ - *2(x - if/N ^ ^ ^ 


These two conditions that we have postulated are those for which we 
should expect a satisfactory coefficient to give us a maximum value of 
+1 and a minimum value of —1. Between these two extremes we 
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should expect the coefBcient to give us values varying between +1 and 
— 1, and this is what it actually does. Our proof as given above indicates 
this also, but it is not a rigid proof in that particular respect. 

Having satisfied ourselves that when we have perfect positive corre- 
lation the coefficient will be -hi, and when we have perfect negative 
correlation the coefficient will be ^1, it remains to decide how the 
.coefficient will measure correlations that fall within this range. As a 
matter of fact it is easy to state this proposition, but quite difficult to 
explain it in a simple and satisfactoiy manner. Perhaps the best inter- 
pretation arises from considerations that actually are more closely 
related to the theory of linear regression than that of correlation. For 
example, if we take y to be the independent variable, then we can work 
out the relation between the correlation coefficient and the two vari- 
ances, the total for y, and the variance of the errors of estimation. As 
pointed out in the previous chapter, the sum of squares of the errors of 
estimation is E(y — Y)^, where Y represents points on the regression 
straight line corresponding to each value of y. The variance of the 
rrors of estimation is therefore given by: 


t’i 


S(y - Yf 
N -2 


(5) 


Now . variance of y is related to the above variance in the manner 
indica.of * v the ?wir^ equations: 


S(y - y)^ 

AT - 1 

(1 r^)S(y - y)^ 


( 6 ) 

(7) 


From which it follows that the ratio of the two variances is: 


- = (1 - r2) 


N - 1 
N - 2 


( 8 ) 


On tne same basis, if we examine the relation between Vj, and the variance 
due to the regression function, the latter being given by: 


06 = bl^(x - x)Vl or r® ^ (9) 


we find that: 




r^{N - 1 ) 


( 10 ) 
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Finally, the ratio Vhh. is given approximately by: 


Vb 

Ve 


r2 

(1 


(JV-2) 


( 11 ) 


The variance is frequently taken as representing that portion of the 
variation in y which is independent of x; hence we note that from this 
standpoint equation (8) is the most important. If is expressed in per- 
centage of Vyf then it is clear from (8) that this percentage is almost pro- 
portional to (1 — r^). This is another way oi expressing the commonly 
known fact that differences between high correlation coefficients are 
much more significant than similar differences between small correla- 
tion coefficients. As a measuring stick for general use it is therefore 
much more convenient to think in teims of than in terms of r. For 
example, if we have a correlation coefficient of 0 5, the ratio v^/vy = 0.75, 
and the ratio does not fall to 0.5 until r reaches 0.75. 

Considerable space might be devoted to further viewpoints on the* 
interpretation of the correlation coeffi(‘ient, and the student who is 
especially interested in this phase of statistics should refer to the discus-"* 
sions in the references cited at the end of this chapter. Special notice 
should be taken of the discussions by K. A. Fisher (1) of the distribution 
of the correlation coefficient; by G. W. Snedecor (4) of the relation 
between ''common elements*' and the correlation coefficient; and by 
A. £. Treloar (6) of many phases of the entire subject of correlation. 

4. Testing the Significance of the Correlation Coefficient. R. A. 
Fisher (1) has shown that for small samples the distribution of r is not 
sufficiently close to normality to justify the use of a standard error or a 
probable error to test its significance. A more accurate method has 
been developed by Fisher, based on the distribution of t. For a correla- 
tion coefficient: 


ry/n 

Vl - r® 


( 12 ) 


where n = the number of degrees of freedom available for estimating the 
correlation coefficient. The degrees of freedom can always be taken 
equal to N — 2, because there is a loss of one degree of freedom for each 
statistic calculated from the sample in order to obtain r. These are y 
and byx (the regression coefficient). Although byx may not actually have 
been calculated, it is involved in the formula of the correlation coefficient 
throuc^ the sum of products 2(x — x)(y — y). This point will be 
clear from a consideration of equation (8) which shows that the ratio 
vjvy is a function of the correlation coefficient. Now measures the 
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discrepancies between individual values of y and the corresponding 
values of Y estimated from the regression equation. It follows from 
this that the correlation coefficient can measure only that portion of the 
relation between x and y which is represented by the regression equation. 

Since the use of i provides a correct method of testing the significance 
of a correlation coefficient regardless of the size of the sample, in general 
practice one uses this method for samples of all sizes. For large samples 
one might calculate a standard error of r, but even this procedure would 
be subject to criticism if the value of the correlation coefficient was 
high. 

For testing the significance of the difference between two correlation 
coefficients t is not suitable, and Fisher (1) has developed an accurate 
method which involves transforming the values of r as follows : 

2 ' = i{log,(l + r) - log.(l - r)i (13) 

The values of can be shown to be normally distributed even for small 
samples and with a standard deviation given by: 


1 

Vn'-z 


(14) 


To test the significance of the difference between two correlation coeffi- 
cients r\ and ra, we proceed as follows: 

2 I = + n) - log,(l - ri;} 

z'i = ■^{log.d + n) - log,(l — 72)1 

Zx — z'j = differonc'e 





iVa 


- 3 


(15) 


where Ni and N 2 are the numbers in the two samples from which ri and 
r 2 respectively have been calculated. Finally: 


z'l - z'2 


The table of i is entered under A^i + iV 2 — 6 degrees of freedom. 

6. Calculation of the Correlation Coefficient. From the previous 
chapter, the methods for calculating the sum of products 
2(x — £)(y — y), either directly from paired values or from a correla- 
tion table, will have boon noted. It is sufficient therefore to note that 
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the formulae pven in (2) and (3) may be written as follows in convenient 
form for calculation. 

'L{xy)/N — Sig 

V(S(*2)/JV - *2)(2(y*)/iSr - 

l^{xy) - T,TJN 

V(S(a:a) - 2’j|/i\r)(2(ir*) - 7^/iV) 

JNrS(jy) - T,T, 

V(JV2(*2) - 2l)(Arr(y8) - It) 

Formula (17) is the most direct, but (16) and (18) are perhaps better 
suited to machine calculation. In (18) there are no (^visions in either 
the numerator or the denominator; and after all the preliminary calcu- 
lations of the values of ^(xy), T„ T„ S(z^), and 2)(j^) have been per- 
fonned, each of the three factors in the formula may be obtained without 
removing any figures from the machine and recording them elsewhere. 

The methods of calculating 2!(xy), Tt, 2(a^), and 2(2/^) will oi 
course be the same as described in Chapters II and VI. They may bf 
calculated either from the correlation table or directly from the pairec 
values. For .V » 50 or less it is probably best to proceed directly, as 
setting up the correlation surface is net likely to save any time. When 
the numters are fairly large it is nearly always best to have a coirelatior 
table, as we shall learn later of a test to determine the agreement betweei 
the actual data and the straight line fitted by the regression equation 
and to carry out this test the correlation table must be set up. 


(16) 

(17) 

(18) 


Example IS. Direct Calculation of the Coirdation Coeffident from Pairei 
Values. For the sets of paired values given in Table 16 the calculations of Z(xp 
were performed and the reeulta pven in Section 3 of this chapter. Let ue aaBum< 
that we wish to calculate the correlation coethdents using formula (17). 


Sell 


Set 2. 


2(*») - T,T,/JV - 262 - 200 - 2.0 

2(**) - Tl/N - 324 - 260 - 74.0 

2(p*) - Tj/iV - 360 - 270.4 - 70.6 
2.0 

** V74.0 X 70.6 
2(xp) - T,Ty/N - 336 - 260 - 76.0 
2(x*) - Tj/JV 


--I-O.0M 


S(p*) - Tj/jv - 


gSame aa Set 1 


75.0 


VT 4 . 0 X 79.6 


--t-a907 
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Sets. 


- T,Ty/N 

2 :(**) - ti/N 

2(y*) - Tj/N 


- 186 - 260 --74 0 


jseme u Set 1 

- 74.0 _ 
V 74 OX 79.6 


- 0.964 


To calculate for Set 2 umng formula (18) we would write directly: 


10 X 336 - 50 X 52 

V(10 X 324 - 80»)(10 X 360 - 62 *) 
and performing one operation with the machine for each factor we obtain: 

750 

V 740 X 796 


By one more operation we find the denominator and have: 




750 
767 58 


-+ 0.977 


Example 18. Calculation of the Cofrelation Coefficient from a Correlation Table. 
Suppoae that we wi^ to calculate the corrdation coefficient for Table 12, Chapter VI. 
The first step ia to prepare Table 13, which we have already uaed in Example 11 to 
calculate the regreaaion coefficient. From this table we have: 

2(xp) - 5448 r. - 850 

- 3962 Ty - 1246 

Z(/} -8180 


And making use of formula (18) above we calculate: 

200 X 5448 - 850 X 1246 
V (200 X 3952 - 860»)(200 X 8180 - 1246 *) 


W ,500 30,600 

V 67 , 9 W X 83,484 ~ 260.68 X 288.94 


04061 


Example 14. Testa of Significance. Although the correlation coefficients cal- 
culated in Example 12 were for only 10 pairs of values, the t test will give a rdiable 


measure of their significance. 

The t values are determined as follows: 

Srt 1. 

Try -+0.026 

Vl - 0 026* 

Set 2. 

-+0.977 

0.977V1 

1 - . - 129 7 

VI - 0.977* 

Sets. 

0.964 

0.064 VS 

( - i02.6 

Vl - 0.964* 
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lYirniiiK to Table 94 we note that for n • 8 and P 0.05 tbe value of t required 
ia 2.306. The eoeffieient 0.026 ia therefore quite inaignifieant, but the other two are 
hig^y aignifieant. 

Example 15. The Signiflcance of Difference between Correlation Coeflicienta. 
In a atudy of the relation between the carotene content of whet flour and the erumb 
color of the bread, Goulden, et al. (2)i obtained the following reaulta with 139 
▼arietiea. 

Carotene in whole whet with crumb elor, ri • - 0.4951 
Carotene in flour with crumb color, ft — 0.6791 

The moat accurate method for thia tet fa to make ue of Fuher’a a' tranaformation. 
For the c' tet we write: 

*'i - }{log« (1 + 0.4061) - log. (1 - 0.4061)} 

- I (log. 1.4061 - log. 0.6040) 

( 1 495l\ 

- Jlog.2 0612 - log»2.0612X 1.1613 

- 0.47147 X 1.1618 - 0 6428 

H - } (log. 1.6701 - log.0.4200) 

- - I kg, 3.7617 - 0 57423 X 1.1613 - 0 8611 

fi - s'l - 0 6611 - 0.6428 - 0 1183 

+ tH “ “ 0.1213 

Sine the diflerene la lee than its atandard deviation it ia not aignifieant. 

Note that in writing out the formula for a' we pay no attention to the aign of r 
M It ia the numerical differene between the ooefficieis that we are teating. 


6. Exercises. 

1. The figure in Table 17 are the phyaie and Engliah marka ' for home eenomie 

students in the University of Mamtoba. Determine the corrriation coefiicient for 
the relation between the marks in the two subjects. Use the direct method, and 
test the significance of the coefficient. r - + 0.705. 

2. For the same 60 students the correlation coefficient for the marks in art and 
clothing ia +0.7300, and for art and physics it is +0.6491. Is this a significant 
difference? 

3. Detennine the correlation coefficient for days to head and days to mature of 

25 wheat varieties, using the data from Table 14. Find tbe fiducial limits at tbe 
5% point for this coefficient. r •■ + 0 046. 

^ By courtesy of the Registrar, University of Manitoba. 



REFERENCES 


77 


TABLE 17 

Mabu m Phtsics and Enoubb or 80 Sruoiom in Hoa Econohicd 
or m Univnbbitt or Manitoba 



















C3IAPTER VIII 


PARTIAL AND MULTIPLE REGRESSION AND 
CORRELATION 

1. The Necessity for Dealing with More Than One Independent 
Variable. In many regression problems the investigator is concerned 
purely with the effect of one variable on another, and this holds true 
regardless of other complicating factors. Suppose tliat a new rapid 
method has been developed for determining the protein content of ^rain 
samples and this method is to be compared with an older and thoroughly 
tested method which is known to give very accurate results. The two 
methods are used on a large series of samples and for the entire series 
the linear regression equation is determined for the regression of protein 
by the old method on protein by the new method. Regardless of how 
these two variables are related, from the practical standpoint of studying 
the efficiency of the new method as a substitute for the old method, it is 
clear that the investigator is concerned purely with the closeness of the 
relationship between the two variables. The new method may not ac- 
tually measure protein content but some other factor that is so closely 
associated with protein content that if we know one we know the other. 
Hence, although the relation between the two variables may be indirect, 
it is the total relation with which we are concerned, as we require merely 
a measure of the accuracy with which we can predict one variable from 
individual measurements of the other variable. In examples of a some- 
what different nature it may be quite misleading to study only the total 
relation between two \ariables.\/^uppose that we find a correlation of 
-h0.60 between the yield of wheat and temperature. Can we conclude 
from this result that, if all other conditions remain constant, there will 
be an increase in 3 deld with increases in temperature? The answer is 
no, because temperature may be associated with some other factor in- 
fluencing yield and this second factor may be the one that is actually 
causing the variations in yield. Suppose that the second factor is rain- 
fall, which is probably the most important of the meteorological factors 
influencing the yield. If rainfall is itself associated with temperature, 
it is clear that there must also be a correlation between yield and tem- 
perature. The latter correlation, however, does not provide us with 
any information of a fundamental nature with respect to the actual 
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ehanges in yield brought about by changes in temperature. What we 
require here is a measure of the association between yield and tempera- 
ture when the rainfall remains constant. To the extent that the rela- 
tions between the three variables in a problem of this kind can be ex- 
pressed by linear functions, the measure that we require can be obtained 
by the method of partial regression or partial correlation. Thus the 
partial correlation of yield and temperature will measure the degree of 
covariation for these two variables with a constant rainfall. The partial 
regression coefficient for yield and temperature will give the actual in- 
crease in yield for one unit of increase in temperature when the rainfall 
is constant. If the correlation coefficients for the three variables are 
as follows: 

Tyi (yield and temperature) = + 0.60 
Tyr (yield and rainfall) = + 0.82 

Ttr (temperature and rainfall) = + 0.78 

the partial correlation coefficient for 3 rield and temperature with rainfall 
constsjpt may be represented by Vyt ^ in which the variable placed after 
the period is the one that is held constant. Applying the partial corre- 
lation method as illustrated below we find =+ 0.09. Therefore 
the actual effect of temperature when rainfall is constant is practically 
nil. 

It is just as well to emphasize by means of this example that the 
method of partial regression and partial correlation as we are considering 
it here has to do onl y with the linear relation between the variables . If 
the effect of temperature on 3 deld is not the same for a constant low 
rainfall as it is for a constant high rainfall, then the linear measures are 
inadequate to express the actual relation. 

2, Derivation of Partial Regresrion and Partial Correlation Methods. 
The method of simple correlations is derived from the regression equa- 
tion: 

y - y = - £) 

where is th'j regression coefficient Similarly, when there are three 
variables y, x, z, the regression equation is: 

y ^ y = - x) + by,(z — z) 

In order to simplify the writing of these equations we use xi for the 
dependent variable and X 2 , X 3 • • • x« for the independent variables. 
Also bi 2 represents the regression coefficient for xi on X 2 , and to abbrevi- 
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ate further we write xi for (xi — xi) and X2 for {x2 — ^2). Hence the 
general regression equation for n variables is: 

Xi = 6122:2 + 6132:3 + 6142:4 + ■ ■ ' + hlnXn (1) 

The error in estimating 2:1 from this regression equation will be: 

{xi — 6122:2 — 6132:3 — • • • — 6u2:n) 

and it is required to find values of the regression coefficients such that 
the sum of the squares of these errors is a minimum. That is, we must 
find values of the regression coefficients such that 

2(2:1 — 6122:2 — 6132:3 — • • • — 6ln2:n)® 

is a minimum. For 4 variables this leads by mathematical treatment 
to the following '^Normal Equations” 

^{ xiX 2 ) = 6i22(x2)^ + 6 i 32 (x 22 : 3 ) + 6142(3:2X4) 

2(X|X3) = 6i22(X2X3) + 6i32(X3)^ + 6i4S(X3X4) (2) ^ 

2(XiX4) == 6i22(X2X4) + 6i32(X3X4) + 6i42(X4)* 

For a set of n variables there are (n — 1) simultaneous equations for 
which the sums of squares and sums of products are known, and by 
solving these we arrive at the values for the regression coefficients. 

Any partial correlation can then be determined as follows: 

• *=*^^ 12-3 X621.3... n ( 3 ) 

For three variables xi, X2, xs, the normal equations are as follows: 

2(XiX2) = 6i22(X2)® + 6 i 32 (X 2 X 3 ) 

2(XiX3) = 6i22(X2X3) + 6i32(X3)2 


from which it can be proved that 


ri 2.3 = 


Similarly 


ri2 ri 3 »r 23 
^(1 — ri 3 )(l — r 23 ) 


and 


ri 3.2 


ns — ri2-r2z 



r 23 — ri2ri8 



( 4 ) 
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This is the most rapid method of obtaining the partiak for only 
three variables. For four or more variables it is best to make use of the 
fact that the normal equations can be written as follows, taking as an 
example the equations for five variables; 


ri2 — 012 4 - 0l3r23 + 014r24 + 0l5r25 

ri3 = 012r23 + 013 + ^14r34 + 013^35 

ri4 = 1812^24 + 013r34 + j8l4 + 0l5r46 

ri 6 = 012 r 25 + 1813^35 + 014 r 45 + 013 


The correlation coefiicients are the known values, and the beta (0) Values 
the unknown. The latter ran he used as illustrated below to compute 
the partial correlation coefficients. 

Tabular methods of solving these equations for the beta values have 
been devised which reduce the labor to a minimum. The beta values 
are defined by: 



Hence, on referring to equation (3) above, we find that: 



And hence: 


V^j 8 l 2.3 


n'02l>3 • ••% 


ri 2<8 • • • » 


( 9 ) 


In order to obtain all the beta values, it is necessary to rewrite the 
normal equations in different ways and solve. For example, in order to 
obtain /8211 the equations for five variables must be written. 

rsi “ 021 + ftafis + 024ri4 + 02Bris 

T2!3 02lTl3 4 “ 023 + 024TZ4 4 " 023^33 

T24 ** 021TU + 02dT34 4 " 024 4 " /)25r45 

r 25 = 02iriB + 023^33 + 024T43 4 “ 023 
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Correlation coefficients are often referred to as coefficients of the 
pth order, where p is the number of variables held constant. Thus the 
simple coefficient ri2 is of zero order, and the partial coefficient ri2.345 is 
of the third order. 

3. Example 16. Calculation of Partial Regression and Partial Correlation 
Coefficients. The simple correlation coefficients in Table 18 were obtained in a study 
(2) of the effect of the physical characteristics of wheat on the yield and quality of 
flour. 

TABLE 18 

Simple Cobrelation Coefficients for the 
Relations between Six Variables 



1 

2 

3 

4 

5 

6 

0 6412 

-0 3190 

-0 4462 

-0.3511 

-0 3092 

6 

-0 3123 

0 2861 

0.1467 

0 1882 


4 

-0.3947 

0.0429 

-0 0655 



3 

-49 5612 

0 3114 




2 

-0.4589 






where 1 =>= yield of straight grade flour. 4 = per cent immaturity. 

2 » per cent bran frost. 5 «= per cent green kernels. 

3 = per cent heavy frost 6 =* weight per bushel. 

In order to use the above method to determine the effects on 3 deld of flour of any 

one of the forms of damage or of weight per bushel, it is necessary to determine the 
partial correlation coefficients: 

'’ 12 'MM, ru,24Mf ru 2356 , fu 2846 f ns-SMt 

For which we will require 

fil2 fiu fisir 014 -fin, 01S-0Bh 0U'0BI 

We solve for these by the method illustrated in Table 19. It is a tabular method of 
solving the simultaneous equations and is best understood from a study of the table. 

Note that the calculations of Table 19 give fii2, 0i3, 0i4, 0 ib, and 0 is, and that in 
order to obtain the other beta values the simple correlation coefficients must be 
rearranged and the calculations repeated. The rearrangement in the order 6; 5, 4, 3, 
1, 2, will give ^ 21 , 023, 024, 02b, 026 The next logical rearrangement is 6, 5, 4, 1, 2, 3, 
giving 032, 031, 034, 086, 036 We coniiijue rearranging the simple correlation coefficients 
until all the beta values have been calculated. Then they are put together in a table 
and we select those necessary in order to give the required partials 

The following instructions will be found useful in carrying through the tabular 
method of solving the equations. 

(1) Rule a sheet of paper as in TabL 19. 

(2) Enter all the correlation coefficients as indicated in lines 1, 3, 7, 12, and 18. 
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(3) Sum the correlation coefficients to obtain values given in column S. 

Note that the first sum, line 1, is rai + + ras + rai + ras + raa^ the sum in line 

8 is rai + raa + faa + rai + ffis + raa, the sum in line 7 is rn 4- + »’ 4 i + *'4i + 

+ ftt + raa, etc. 

The S column provides a check for all the preceduig work. The values 1.0662 
and —1.1789 must check with the sum of the values in lines 5 and 6 respectively 
There are similar checks in the S column of hues 10 and 11, 16, and 17, and 23 
and 24. All these checks are approximate, and therefore the values obtained in 
the check column will not agree with those calculated from the body of the table 
to the last decimal figure. 

(4) The last value calculated in line 24 is /9i2 with its sign changed. It is 
written below in line 1 of the reverse with the correct sign, and also in column 2 
line 1 of the reverse. The remaining values in column 1 come from linte 17, 11, 
6, and 2, of the same column but with their signs reversed. 

In column 2 the values are: 

^12 X (17 2) 

/5«X (11*2) 

/Ji2 X (6-2) 

|9i2 X (2 2) 

In line 2 (reverse) add from right to left and obtain /9is, then the remaining 
values in column 3 are* 

/iu X (11 3) 
fin X (6 3) 

^iiX(2-3) 

In line 3 (reverse) add from right to left and obtain ^u, then the remaining 
values in column 4 are; 

fixA X (6.4) 
fin X (2.4) 

In line 4 (reverse) add from right to left and obtain ^u, then the remaining 
value in column 5 is* 

finX (2.6) 

In line 5 (reverse) add from right to left and obtain fi\% 

After completing the calculations as in Table 19 the correlation coefficients are 
arranged in the order 6, 5, 4, 3, 1, 2, in a new table and the calculations carried out as 
before. For 6 variables there will be 6 tables to calculate, each table giving 6 of the 
total of 30 beta values. When the latter have all been calculated they can be tabu- 
lated, and all that remains is to work out the partials. It is convenient to make a 
table such as Table 20 for entering the beta values and the corresponding partial 
MrrelatioD coefficients. 
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00 

i i| 

d do 

1.0662 

-1.1780 

0.4106 

0.0757 

-0.0038 

0.4017 

—0.41110 

0.3852 

0.0062 

-0.0102 

0.1080 

0.5742 

-0.7722 

0.8026 

0.0688 

-0.2210 

0.0306 

-0.1123 

0.6375 

-0.7751 


1 

0.5412 

-0.6412 

-0.3123 

0.1063 

—0.1140 

0.1200 

-0.3047 

0.2251 

0.0100 

-0.1506 

0.1S3& 

-0.5612 

0.2861 

0 0011 
-0.0400 
-0.3140 
0.4235 


iiiii 

e dodd 

1 1 1 1 

e« 

-0 3190 

0 8100 

0.2861 
-0 0066 

0.1575 
-0 2073 

0 0420 
-0.1120 
-0.0165 
-0.0856 
0.0084 

0.3114 

-0.1423 

-0.0018 

-0.0210 

0.1454 

-0.1055 

1.0000 
-0.1018 
-0.0389 
-0.0084 
-0.0284 
0.8225 
-1 0000 

GO 

d 

1 

ills 

oooo 

1 1 

eo 

c«ica 

II 111 

od dot 
1 1 

30 OOOOO 

1 1 1 1 I 

1 0000 
-0 1991 
-0 0001 
-0 0572 

0 7436 
-1 0000 


eo 

O 

1 

gp 

000 

1 1 

1 

-0 3511 

0 3511 

0 1882 
-0 

u u/vo 
-0 0880 

1 0000 
-0 1233 
-0 0070 

0 8697 
-1 0000 




0 

1 

0-* 

oo 

1 

»o 

od ^d< 
1 1 

1 



Cl 

a 

■3 

O 

■ 1 

-0 0152 

«o 

0000 I- 
0000 I 

. : : : : : 

• • 


s 

w 

o 

Line 

-^04 eo'U'« 

k^qooo^ 

04 09 ^ to « 

^ ^ ^ ^ 


i-i -‘I 'VS lO 


O 'to 

" • . • 

. : rd 

s : 

: 

n 

ll ll^ 

Divide line 5 by 5.5 and change aigna 

Enter uu Ut, rn, m 

line 1 by 2.4 . . 

Multiply line 5 by 0.4 4 

Add linea 0to7 

Divide line 10 by 10.4 and change aigna . . . 

Enter m, rn. fit 

Multiply bne 1 by 2.3 

Multiply bne 5 by 6.3 3 

Multiply bne 10 ^ 11.3 

Add bnea 15 to 12 

Divide bne 16 by 16.3 and change aigns 

Enter rji, rti 

Multiply line 1 by 3.2. ... .2 

Multiply bne 5 by 6.2 

Multiply line 10 by 11.2 . . 

Multiply line 16 by 17.2. . . 

Add Imes 22 to 18 . 

Divide bne 23 by 23.2 and change aigna . . 

SSS3S 

draSSn 

OOOOO 

MM 

IIIII 

QlQQ.tXIQaQ!l 


In the inatruetione 2.5 reprcaenta the value 0 3092 in bne 2, column 5. Siznilarly, 6 4 representa the value in line 6, eolaoiB 4. and ao forth. 
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TABLE ao 

BatA VaIiUM AMD PaMIIAL CoBULATIOM CoBinonMTB 


Subscript 

0 

Subscript 

0 

r Subscript 

r 

12 


21 

mmm 

12.34M 

- 388 

13 


31 

■■11 

13.2456 


14 

m 

41 

■ 

14.2856 


36 

■ 

66 

■ 

56.1284 



4. Teste of Significance. Tbe t test is applicable to partial correla* 
tions in the same way as to simple eorrelationa but the degrees of fiee- 
dom are different. If p is the number of variables held constant, for 
partial correlation coefficients we have 

t -p -2 (10) 

6. Multiple Correlation. In our example, if we consider not the 
separate but the total effect of weight per bushel and the different forms 
of damage on the yield of flour, the problem is one of multiple correlation. 
Since all these variables have some effect on flour yield the more infor- 
mation we have on them the more closely we can predict the flour yield 
of a particular sample of wheat. 

A simple correlation coefficient measures the relation between a de- 
pendent and one independent variable. A multiple correlation coeffi- 
cient measures the combined relation between a dependent variable and 
a series of independent variablea 

Equation (1): 

*1 *■ hiaSa + huss hi4*4 + • • • + hin** 

is in reality a multiple regression equation as it may be used to predict 
values of xi from the known values of 12, X3 , 24 ■ ■ • 2.. 

6. Calculation of Multiple Correlation Coefficients. Two methods 
are in use for tbe calculation of the multiple correlation coefficient. 
These arise from the two equations (11) and (12) below: 

1 “ ffl -23 • • • » — (1 ~ ^2)^1 — ^ 3 ' 2)(1 ■" ri 4 ' 23 )(l ^14.234) . . . 

(1 — ri^.js . . . ,_i) (11) 

IP = + . . . + /Jln-ru (12) 
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Method (11) can be used only when all the partial correlation coeffi* 
dents of the first, second, third, to the (n — 2) order are known, and 
hence it is imposdble when the partials have been obtained by solving 
the normal equations. It is very useful, however, when only three 
variables are being studied. For three variables we have: 

1 - i2?.a = (1 - r?a)(l - 

Method (12) is directly applicaUe when the partial correlation coeffi- 
cients have been obtained by solving the normal equations for the beta 
values. 

7. Testing the Significance of Multiple Correlations. It should be 

noted that, in equation (11) above, any one of the factors such as 
1 r^i 8*2 cannot be greater than unity, since the square of a correlation 

coefficient cannot be less than zero. Hence if we compare 

(1 • • • ») and (1 1 ^ 2 ) 

1 “ fS-28 • • • * < 1 ri2 

giving 

f2l'28 . • . • 1 > ^12 "" 1 

or 

f2l‘23 . • . * > ri2 

Similarly for any other factor on the right of the equation; hence: 

/ 2 i >23 • • • II ^ • • • r ?»*28 • • • »— 1 

The multiple correlation coefficient is greater therefore than any of 
the constituent coefficients; and its minimum value is zero and not 
— 1, as is the case with a simple or partial coefficient. For this reason 
a special table must be used for testing the significance of multiple cor- 
relations.^ The calculation of t values, standard errors, or probable 
errors will give entirely erroneous results. Two tables that may be used 
are in the references given below. 

8. Exercises. 

1. Complete the calculation of the partial correlation ooeffidents begun in 
Example 16. The following values will assist in checking the work: 

rjs.MM " —0.3177 
rii'iitt " 0.3367 

ru-iMf *" —0.0303 
ns ISM ■■ —0.1373 

‘ A test is deseribed in Chapter XUl that is based on the analyaiB of variance. 
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2. If N is 36, determine the minimum value of a fourth order correlation coefhcieDi 
that is significant Put r in terms of t and the number of degrees of freedom. 

The value obtained should be 0.3403. 

3. Calculate the multiple correlation coefficient Ri 23466 for the same data as in 

Example 16, and determine its significance. R 0.7036. 

4 . Write the simultaneous equations for three variables in the same form as (6) 
above. Then prove* 

^■12 — ri3 r2a 

V(i - -fh) 



CHAPTER IX 


THE (CHI-SQUARE) TEST 

1. Data Classified in Two Ways. On reviewing the types of prob- 
lems that have been presented in the previous chapters, it will be recalled 
that they have dealt with data of two kinds. In the first place we 
studied an example in which an operator attempted to classify grain 
samples according to variety. The samples were placed either rightly 
or wrongly, and there was no intermediate condition The power of 
the operator to differentiate the samples was therefore measured in 
terms of the number of samples placed correctly. With a little thought 
it will be clear that a great many problems must occur m which the data 
are of this type. Thus, in describing the health of a population, an 
obvious criterion will be the proportion of the population that are ill, or 
perhaps the percentage dying within the year. Again, a set of varieties 
of a cereal crop may be differentiated by the number of seeds that are 
viable, and so forth. In further examples the data were of a different 
type as in the case of yields of wheat plots, weights and heights of men, 
and degree of infection. We may be reminded, by these remarks, of 
the classification of variables as continuous and discontinuous, wherein 
the distinction between the two is fairly clear cut. Will data arising 
from discontinuous variables always fall into the first class mentioned 
above, and data from continuous variables into the second class? The 
answ^er is that they will not be so easily separated in this way, as we can 
easily imagine a situation in which data for a continuous variable may 
be treated by the two methods. We may take as an example a com- 
parison of the yields of two varieties of wheat. In the first place, if 
there are a sufficient number of plots we may compare the two varieties 
according to the number of plots that fall into an arbitrarily determined 
low- 3 rielding class, or an arbitrarily determined high-yielding class; or 
better still we may compare the numbers of plots falling into both 
classes. In the second case we may simply compare the average yields 
of the two varieties on all the plots. Which method shall we use? This 
question is also very easily answered, as it will be clear that the first 
method applied to an example of this kind is cumbersome and unwieldy, 
and will be used only when the numbers are fairly large and the method 
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of classifying the plots according to yield is only approximate. For 
example, in a comparison of two varieties as grown by farmers it may 
be impossible to obtain accurate yields, but it may be possible to classify 
the fields quite accurately into the groups low-yielding and high-yielding. 
Then, with a fairly large number of fields to work with, a good com- 
parison of the varieties may be made simply by determining the number 
in each group. For discontinuous variables, on the other hand, com- 
parisons will usually be found to be most conveniently made by the first 
method, and this is particularly true if the character with which we are 
concerned is definitely not measurable in a quantitative manner. Thus 
people may be classified only as dead or alive; and although there may be 
a theoretical situation existing for a short period in which this classi- 
fication is uncertain, it is certainly of no practical significance in describ- 
ing what has happened to two populations as a result, say, of their 
having received two different treatments. 

In this chapter we are concerned mainly with methods of appl 3 ring 
tests of significance in examples iiriiere the data are in the form of fre- 
»quencie8 as in the first class mentioned above Snedecor (4) has very 
aptly used the term enumeration data to describe data of this type. 

2. Tests of Goodness of Fit. In many problems the test that is 
required is a comparison of a set of actual frequencies with a correspond- 
ing set of theoretical frequencies. Thus in experiments in genetics an F 2 
population may be classified into two groups, as in a wheat experiment 
in which the F 2 population of 131 plants is classified as 106 that are 
resistant to rust and 25 that are susceptible. The predOminEuice of 
resistant plants can be explained by the well-known theory of dominance 
of the genes for rust resistance coupled with the supposition that rust 
reaction is determined by only one pair of genes, one parent having con- 
tributed the gene for rust resistance and the other parent the gene for 
susceptibility. This is plainly an hypothesis which gives a general 
explanation of the results, and as such may be subject to testing in the 
same manner as the famUiar null hypothesis of Chapter I. The pro- 
cedure of this test follows from the following considerations. 

In a population for which the hypothesis is true, if a large number of 
samples of 131 plants each are taken, these will be found to vary around 
a mean value for the frequencies of resistant and susceptible plants 
which will be directly calculable from the hypothesis. Thus in the 
present example it is easily demonstrated that the mean of such a popu- 
lation will be 98 25 resistant plants and 32.75 susceptible plants. In 
taking samples from this population, it is to be expected that owing to 
random variation some of these samples will exhibit quite wide varia- 
tions from the mean of the population, but a large proportion of them 
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will, of course, be fairly close to the mean of the population. If we 
knew the th('oretical distribution of such samples around the mean, we 
could calculate for samples the same size as ours the numbers of resistant 
and susceptible plants which would occur as the result of random varia- 
tions in only 5% of the trials. This would establish for us the 5% level 
of significance— that is, if our actual sample fell outside of the range of 
this 5% level we would say that the data did not substantiate the 
hypothesis, in fact it is fairly convincing evidence that the hypothesis is 
not true. If our sample fell well within the 5% level we would then 
say that there was good agreement between the data and the hypothe- 
sis, but the hypothesis would not necessarily be proved. Now the dis- 
tribution of the samples can be calculated directly by methods similar 
to those used in (chapter 1, and we shall see in Chapter X that if the 
sample is small it may be advantageous to proceed on this basis How- 
ever, for general application a much easier method is available. This 
method involves the calculation for the data of the sample a statistic 
known as (ehi square) which is distributed in a known manner depend- 
ing on the number of degrees of freedom available for its estimation. 
For the general case x^ is given by: 



where a represents the actual frequencies and I the corresponding theoret- 
ical frequencies. Thus in the present example the actual frequencies 
are 106 and 25, and the corresponding theoretical frequencies are 98 25 
and 32.75. The two values of a — i are therefore both equal to 7.75, 
and x^ = 7 75V98.25 + 7.75^/32.75 = 2 445.^ The number of degrees 
of freedom available for the estimation of is 1. In this respect the 
problem is similar to the t test for the differen(*es between paired values. 
Here we have two pairs of differences as represented by the two values of 
a — i, and consequently there is only one degree of freedom. Another 
concept of the degrees of freedom arises from the fact that there are 
only two classes, resistant and susceptible. The total number in the 
sample being fixed, if the number in any one class is fixed the number in 
the other class must also be fixed. There is therefore only one class 

^ For simple ratios a direct formula suggested by F. R. Immer for calculating 
may be used. This formula is: 

j (Ci - a2g)^ 

xN 

where the theoretical ratio is z : 1, ai is the actual frequency corresponding to x 
and Of is the actual frequency correqionding to 1. N i» the total frequency. 
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, which can be arbitrarily assigned a given frequency, and this means that 
there is only one degree of freedom. 

The next step in the test is to examine the tables that give the dis- 
tribution of and find the value at the 5% level for one degree of 
freedom. We enter Table 95 and find that the value of x^ at the 
5% point is 3.84. Our conclusion is that the data do not disagree 
significantly with the hypothesis. Of course, we can if necessary go 
further and determine approximately in what proportion of cases such 
a result as ours would be obtained. The x^ value of 2.445 falls between 
the two values of x^ that correspond to the 10 and 20% levels of P. By 
interpolation our value is found to correspond to the 13% point, and 
consequently we can say that a sample showing a deviation from the 
theoretical as great as or greater than the one observed would be expected 
to occur in 13% of the trials. The observed deviation is therefore not 
very important and does not in any sense disprove the hypothesis. 

It should be noted at this point that the possible deviations from the 
theoretical may occur in both directions, and that in the test of signifi- 
cance both these possibilities have been taken into account. Since 
there is very often a good deal of confusion on this point, it may be just 
as well to emphasize here that it is absolutely necessary, in testing the 
hypothesis set up, to take into account possible deviations in both 
directions. Our hypothesis involves picturing a population deviating 
about a mean of 98.25 resistant to 32.75 susceptible plants. Accord- 
ing to the theory, deviations of 7.75 in either direction are equally likely, 
and in our sample the deviation happened to be positive for the resistant 
group and negative for the susceptible group. If we should determine 
the proportion of the trials in which a positive deviation as great as or 
greater than the one observed would occur, it is clear that this proportion 
would be exactly half of the proportion determined above, or about 65%. 
But this would not be a test of agreement with the hypothesis, any 
more than it would be to determine the proportion of the trials, say, 
in which a deviation of +7.75 to + 8.00 would occur. The proportion 
would be very small, but it would in no way indicate disagreement with 
the hypothesis. Another way to consider this problem is to examine 
the possible consequences of accepting as a test of significance the 5% 
level, taking into account positive deviations only. On a large series 
of samples the investigator would expect to classify 5% of the samples 
as giving a significant disagreement with the hypothesis, even when the 
hypothesis is true. If positive deviations only are considered he would 
classify only 2^% of the samples in this way, and consequently would 
not be setting up the level of significance at the 5% but at the 2^% 
point. In certain cases, as we shall see later in the next chapter, it is 
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legitimate as a test of significance to take into consideration the devia- 
tions at one end of the distribution only; but these are special cases and 
not comparable to the example given above. 

Kissupls 17. In a aom of two wheat varieties. Reward and Hope, the following 
results were obtained for the frequendes of resistant, sraiHresistant, and sosoqitible 
plants In the F| generation. 

Resistant 111 

Semi-Resistant 232 
Susceptible 1181 

The theoretical fluencies according to two hypotheses are as follows: 



Single Factor 
Difference 

Two Complementary 
Factors and an 
Inhibitor 

Raristant 

881 

119 

Semi-resiaUnt 

762 

238 

Susceptible 

381 

1167 


If we wish to test the two hypotheses by comparing the actual with the expected 
fluencies in each case, the work may be set up and carried through as foUows: 


Single Factor 

Hypothesis 

Complementaiy and Inhibiting 
Factor Hypotheris 

Aetual 

Theoretical 

(® - <)*/< 

Actual 

Theoretical 

(o - <)*/< 

111 

mmm 

191.3 

mSM 

119 

0.5378 



368.6 


238 

0.1518 

1181 


1679.8 


1167 

0.1680 

X* -2239.7 

n -2 

P -0.0000 

X*- 0.867 11 -2 

P-0.66 


We have two degrees of freedom in each case, and we find for the first oesr that such 
a large value of x* is not given in the table. The largest value under n = 2 is 9.21 , 
which corresponds to a P of 0.01. We can conclude, therefore, that the probability 
of obtaining devuttions, due to chance variation, as great as or greater than those 
observed is too remote to be considered. In the second case, * 0.857 and this 
coneqKmds approximately to P * 0.65. The fit here is very good since deviations 
as great as or greater than those observed may be expected in at least 50% of the 
cases. The final conclusion is that the single factor hypothesis is quite inadequate 
to explain the type of segregation observed, but there is good evidence to support the 
second hypothesis based on a pair of complementary factora and an inhibiting factor. 
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Enmiile 18. In an asBumed croea between parents of the oonatitution EBce and 
66CC, the Ft population is classified as follows: 

^ ^ Total 

1260 626 610 6 2600 

According to a theoretical 0 : 3 : 3 : 1 ratio, the theoretieal fiequendea would be: 

BC ^ ^ Total 

1406 460 469 166 2600 

The actual results differ very widely from the expected as indicated by calculating 
In this case we find ™ 265 60 and referring to Table 05 and entering at 
n = 3 we note that 11.34 is the highest value given. It is clear that the fit is very 
poor; so we proceed to analyze the data for the source of the disturbance, and develop 
a hypothesis more in accordance with the facts In the first place the assumption is 
made when the 0 : 3 : 3 : 1 ratio is built up that the ratio of B to 6 is 3 : 1, and that of 
C to c 18 also 3:1. A discrepancy in either one of these ratios wiU result in a poor fit 
to the 9 : 3 : 3 : 1 for the whole set Consequently we set up the two actual ratios 
and calculate for each 

B 

1885 616 1870 630 

* (1886 - 3 X 616)2/3 X 2600 x* - (1870 - 3 X 630)*/ 3 X 2600 
» 0.2133 * 0.0633 

Now values may be added together or separated into eomponente. In this case we 
can add the two values, obtaining a new x’ of 0.2666. Similarly we add the 
degrees of freedom, obtaining n ^ 2. On looking up the tables we find that the F 
value 18 between 0 96 and 0 60 but closer to the latter, hence the fit is good and the 
discrepancy of the actual from the theoretical 9:3:3:! ratio is not due to the segre- 
g;ation of the individual pairs of factors, but to the behavior of the factor pairs in relat- 
ion to each other In other words, there must be a tendency for the factors to be linked 
in inheritance. It is a common procedure in such cases to calculate the linkage 
intensity. An approved method (1) for examples of this type gives 9% of crossing 
over, and on that basis we can determine a new set of expected frequencies. These are 
set up below with the actual frequencies and another value of x* determined. 


Claases 

Actual 

Frequencies 

Theoretieal 

Frequencies 

<• - <)*/< 


1260 

1266 

0.0199 

KM 

626 

620 


KM 

610 

620 

0 1613 

■1 

6 

6 


Hi 



X* - a2215 
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The theoretical frequencies in this table have been calculated on the basis of 9% 
crossing over, a value which was determined from the sample itself. Therefore, we 
lose one degree of freedom and must enter the table under n 2. In this case we 
find P « approximately 0 90. There is a very close agreement between the two sets 
of frequencies, but it would not be correct to consider this a very satisfactory fit. 
Such close agreement could only occur by chance on the basis of the hypothesis 
being tested in 10% of the cases However, the agreement is not sufficiently close to 
prove that the original data were selected to give a good fit. If we had obtained a 
P of 0 95, It would have been worth while investigating the data to determine the 
reason for the very unusual agreement 

Example 19.' The goodness of fit test may be useful in determining the agree- 
ment between actual and theoretical normal frequency distributions In Chapter HI. 
Example 1, wx calculated the normal frequencies eorreHfionding to the actual fre- 
quencies for the transparencies of 400 red blood cells In Table 21, these two dis- 
tributions are repeated, and the third column gives the calculation of 

TABLE 21 

Actual and Nokmal FitEQUENCiEa fou Tuansparbncies 
OF 400 Red Blood Cells, and Calci'lation of x® 


Actual 

Theoretical 

Normal 

(a - OVt 

4 

4 64 

.0883 

11 

7 92 

1 1978 

17 

16 84 

.0015 

29 

30.28 

0541 

43 

44 76 

.0692 

56 

59 16 

.1688 

58 

64 96 

,7457 

63 

60 40 

1119 

61 

47 56 

3 7980 

25 

31 16 

1 2178 

20 

18 24 

.1698 

9 

8 80 

0045 

4 

5.28 

3103 

400 

400 00 



In connection with a test of this kind, two important points should be noted. 

(1) At the tails of the distribution the theoretical frequencies and corresponding 
actual frequencies are grouped The object is to avoid very small theoretical values 
which, if present, to some extent invalidate the test. The general rule is to avoid 
having theoretical frequencies less than 6. This point is discussed in greater detail in 
the following chapter on tests of goodness of fit and independence with small samples 

(2) The theoretical frequencies are determined from the total frequency and the mean 
and standard deviation of the sample, so we must deduct one degree of freedom for 
each. Thus three degrees of freedom are absorbed in fitting, and since there are 13 
classes we have 10 degrees of freedom for the estimation of x’- 
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In the present example we enter the x* table therefore under n => 10, and note 
that a X* of 7.9377 correq^wnds approximately to a P value of 0.66. Consequently 
the fit may be considered a very good one. 

3., Tests of Independence and Association. From a cross of two 
wheat varieties 82 strains were developed and tested for their agronomic 
characters One set of data for these strains is given in Table 22. On 

TABLE 22 

CnABSlFlCATION OF 82 STRAINS OF WhBAT FOR 
Yield and Character op Awns 


Yield Classes — weight in grams 



151-200 

201-250 

251-325 

Total 

Awned 

6 

B 

21 

34 

Awnless 

18 


9 

48 

Total 

24 


30 

82 


evaminiiig tlie frequencies in tlie 3X2 table, wc note that there seems 
to V>e a tendency for the awned types to give higher yields than the 
awiilcss ones. To test the significance of such a result, we have* to 
determine the probalnlity of its occurrence if the two characters an* 
entirely independent. For this particular problem we have to find the 
percentage of eas(?s in which the above distribution, or one emphasizing 
still more the diflferencc in yield of the two classes of varieties, would be 
obtain(‘(l if there were no tendency whatever for awned varieties to 
yield higher or lower than awnless ones. Such a test could be applied 
by calculating x“ if we could obtain the theoretical frequencies for each 
cell representing complete independence of the two characters A 
reasonable basis for the calculation of these theoretical frequencies is to 
assume that, if the distributions are independent, they will be distributed 
wittxin the table in the same proportion as they are in the totals. Thus 
in the cell in Table 22 containing 6 strains, we should have, on the basis 
of complete independence, x strains where a: : 24 : : 34 : 82. Hence 
a; = (24 X 34) /82. In the cell below, a; = (24 X 48)/82. In the same 
manner all the theoretical frequencies can be calculated, and then we 
can proceed to the calculation of This is the direct method of cal- 
culating x^t hut a shorter method for general use is given below under 
Section 5. 
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4. Degrees of Kreedom in Tables. In goodness of fit tests where 
the theoretical frequencies are determined according to some chosen 
h 3 rpothesis, the degrees of freedom can usually be equated to — 1) 
where N is the number of cells in the table. In certain cases, however, 
as in Example 19 above, additional statistics calculated from the sample 
are utilized to determine the theoretical frequencies, and one degree 
of freedom must be subtracted for each of such statistics. 

In tests of independence or association, the subtotals of the classes 
into which the variates are distributed are used to determine the theoret- 
ical frequencies, and obviously these must be treated as statistics, so 
far as they themselves absorb degrees of freedom. Examining Table 22, 
we note that originally we have 5 degrees of freedom in the table, but 1 
of these is absorbed by the awning subtotals and 2 for the yield sub- 
totals. Therefore we have finally only 2 degrees of freedom left for the 
estimation of Another method of determining the degrees of free- 
dom is to make an actual count of the number of cells that can be filled 
up arbitrarily. To do this we must assume that the subtotals are 
diosen first. Then, as in Table 22, any two cells such as those contain- 
ing 6 and 7 may be filled up arbitrarily but all the rest are fixed. The * 
two cells that can be filled arbitrarily represent 2 degrees of freedom. 

In m X n fold tables the degrees of freedom can be equated to 
(m — l)(n — 1) for the general case with which we are dealing. Special 
cases will of course arise where this rule will not hold, but usually it is 
easy in such cases to arrive at the correct number by some such method 
as that described above. 

6. Methods of Calculation for Independence and Association Tests, 
(a) Far (m X n) fold tables. The generalized table may. be repre- 
sented as follows: 


C 

12 3 n 
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In order to determine we must calculate the theoretical frequency for 
each cell. For cell 11 we find t = (Tci-Tbi)/T, and for cell 12, 
t = {Tc 2 'Tbi)/T, and so forth for all the cells. We then set up the 
theoretical frequencies with the corresponding actual frequencies and 
calculate == 2 [(a — 

(6) For (2 X n) fold tables, A table of this type may be represented 
as follows: 

1 2 3 n 


h 

hi 

bi 

bn 

n 

C\ 

C2 

Cj- 

■ ‘Cn 

Tc 

Tb 

Tb 

Tb ■■ 

•• Tb 

T 


We can calculate x^ for this table in exactly the same manner as for the 
{m X n) fold table above, but a short-cut method giving directly 
without calculating the theoretical frequencies is given by Brandt and 
*Snedecor, as follows: 



( 2 ) 


Each frequency in either of the rows is squared and divided by the cor^ 
responding subtotal. These are summated and the correction term 
subtracted as shown in the formula. The remainder is multiplied by 
the quotient of the square of the total frequency by the product of the 
two subtotals on the right. This formula shows as each value of 
b^/T. is calculated the contribution of each piur to the value of x^. 

(c) Far {2 X 2) fold tables. Representing the (2 X 2) fold table 
as follows; 


bi 

b. 

n 

Cl 

Ot 

T. 

Ti 

T, 

T 


X* is giveh by 


(biC 2 cih^^T 
TlT2TkTe 


(3) 


We multiply diagonal frequencies and find the difference between the 
two products. The difference is squared and multiplied by the grand 
total, and the result is divided by the product of the subtotals. 
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\)(/ 6 . Coefficient of Contingency. It will have been noted that the 
methods employed in tests of independence and association are com- 
parable to the method of correlation, with this essential difference, that 
in the former the categories are either descripiive^Tjiumerical. If the 
categories are numericaT'an jT o(“ e'gual 'fn^nii ude, we cw calculate a 
co neia fion coeffiSer^for any of the tables to which we luually apply 
wiA the reservation that if the categories are very broad we widget 
only an approximation to the true value of the correlation coeffic^t * 
e ven if corrections are mad e Jor pouping. The ne cessity for the use 
of_x? arises, therefore, from material which be clas Slieff,’ ‘gTtga&t 
for one cliara^r, only in descriptive categori^, or in numcrieffeii^ 
gones that are not of equal magnitude. For tables to which onlyjc* 
methods can be applied, some .investigators feel that in addition to the 

t^t, which is essentially a test of significance, they should have seme 
me^ure of assiKiation comparable to the correlation coefficient. A 
measure of this type is Pearson’s coefficient of contingency (C) given by: 

where A is the total number of observations (not the number of classes). 

Since it is a function of the sigrufi(‘anc‘c of the coefficient of con- 
tmgency_nuust be the^siirnc lus for x"- It iwS not necessary, therefore, to 
have a standard error of C in orderTb test its significance. 

7. Exercises. 

1. Test the goodness of ht of observation to theory for the following ratios: 



Observed Values 

Theoretical 

0 

1 


A 

a 

A 

a 

(1) 

134 

36 

3 

1 

(2) 

240 

120 

3 

1 

(3) 

76 

56 

1 

1 

(4) 

240 

13 

15 

1 


The values you should obtain are: (1 ) 1.32 

(2) 13.33 

(3) 3.03 

(4) 0 53 


2. In an family of 200 plants segregating for resistance to rust, if resistance is 
dominant and susceptibility recessive, find the ratio that gives a P value of exactly 
0 06 when fitted to a 3 : 1 ratio 

There are two possibilities, the ratios being 138 : 62 or 162 : 38. 
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3 . In a certain cross the types represented by BC, Be, hC, and he are expected to 
occur in a 9 : 3 : 3 : 1 ratio. The actual frequencies obtained were: 

BC Be bC he 

102 16 35 7 

Determine the goodness of fit, and if the fit is poor analyze the data fiiither to dis- 
close the source of th^ discrepancy 

= 9 86; P IS less than 0 01 Hence the fit is poor. 

In further analysis, tost tlie scgiegation for each factor separately 

4 . Test the goodness of fit of the actual to the theoretical normal frequencies for 
either of the distiibutions fiom Chapter II, Kxercise 2, or Chapter II, Exercise 3 
Watch the grouping of the classes at the tails of the distributions in order that the 
theoretical frequency m any one class is not less than 5 

I'or Kxeicise 2, x* *= approximately 10. 

For Excicjse 3, x® ® approximately 2.G. 

6. Tabic 23 gives the data obtained during an epidemic of cholera (3) on the 
effectiveness of inoculation as a means of preventing the disease Test the hypothesis 
that in the inoculated group the number of per.sr>ns attacked is not significantly less 
than in the not inoculated group, and the number not attacked is not significantly 
greater Note carefully how this hypothesis is worded. 

• TABLr: 23 

FiiEgrENciEP OF A'ctackeo and Nut Attacked 
IN Inoci latei) and Not Inocllxted Grocps 

Not attacked Attacked 

Inoculated 192 

Not inoculated 113 

6. Calculate x^ and locate the approximate P value for Table 22 given in Section 

3 above X* — 15.87. 

7. The data in Table 24 were obtained m a cross between a rust-resistant and a 
susceptible variety of oats The F 3 families w'ere compared for reaction to rust in 
the seedling stage, and in the field under ordinary epidemic conditions 

TABLE 24 

Classification of Seedling and Field Reactions 
OF 810 Fs Families of Oats 


Resistant 

Field Reaction Segregating 
Susceptible 


Seedling Reaction 


Resistant Segregating Susceptible 
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Test the eignificence of the Msooietion in thie table, and calculate the coefficient of 
contingency. 

X* * 1127.87. (This result will vary according to the accurai^ with which 
the i values are calculated. To check approximately with the value given here 
calculate the t values to at least two decimal figures.) 



CHAPTER X 


TESTS OF GOODNESS OF FIT AND INDEPENDENCE WITH 

SMALL SAMPLES 

1. Inadequacy of ttie Criterion and the Correction for Continuity. 

The method of is based on the smooth curve of a continuous distribu- 
tion and, when the numbers are large, gives probability results that are 
very close to the true values. When the numbers are small, and espe- 
cially when only one degree of freedom is involved, the method is 
quite inaccurate. One reason for this will be clear from an examination 
of Fig. 10, representing the distribution obtained by expanding the 



Fiq. 10. — ^Frequency distribution of (} + f )* and corresponding smooth curve. 

Shaded areas indicate the need for a correction to x* for small samples. 

binomial Given a theoretical ratio of 1 : 1, say, for the suc- 

cess or failure of an event, the binomial distribution as in Fig. 10 would 
give the theoretical frequency of the successes through the total range 
from 0 to 8. If we wished to determine the probability of obtaining 
6 or more successes in 1 trial of 8 events, we would find the ratio of the 
dotted area of the figure to that of the whole. A x^ test of the 6 : 2 
ratio, however, would be based on the smooth curve shown in Fig. 10, 
and the probability would be the ratio of the cross-hatched area to the 
whole. The cross-hatched area is obviously less than the dotted area, 

lOi 
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by an amount equal approximately to one-half the area of the 6 : 2 ratio 
column. Consequently the test will give a probability result that is 
too low. 

In order to correct for the above-mentioned discrepancy in the x’ 
test, Yates (8) has suggested a correction which he proposes to call the 
correctionlor co ntinuity . In the ordinary case x^ is given by 2(a — 
where a represents the actual and t the theoretical frequencies. Yates’s 
con^ctionjs applied frcgn.eaoh value of X?.— Jh but jt ' 

niust always be subtracted in the direction that reduces the numerical 
value .of .(? “il- Pig- 10 the application of the correction would 
result in extending the cross-hatched area to the line bordering the col- 
umns representing^ the 5 : 3 and 6 : 2 ratios, and must obviously bring 
about an improvement in the estimate of probability. 

It should be noted in connection with tests of significance applied 
to ratios that the x^ method is exactly equivalent to the use of the 
standard deviation to determine the significance of a deviation from 
the mean. Likewise the correction for continuity must be made when 
the numbers are small. As will be evident from Fig, 10, the correction 
is simply a matter of subtracting ^ from the deviation from the mean. 
To test the significance of a 6 : 2 ratio when the theoretical is 1 : 1 or 
4 : 4, we would take the deviation equal to (6 — 4 — = 1.5. Th e 

standard deviation of a binomial distribution is Vp^n = X ^ X 8~ 
1.4142, and we can test in the usual way, using tables of the probability 
integral. 

The test for ratios is also inaccurate when applied to samples from 
populations having a definitely skewed distribution. In the case of 
ratios of successes to failures where the theoretical ratio is not 1:1, this 
inadequacy of the x^ test becomes obvious. Table 25 pves the true 
probabilities calculated from the binomial distribution of obtaining from 
16 to 0 successes when each trial consists of 16 events. These are worked 
out for two cases: (1) when the theoretical ratio is 1 : 1, and (2) when the 
theoretical ratio i.s 3:1. The corresponding values obtained by 
calculating x^ with and without Yates’s correction are given in the same 
table. For the symmetrical binomial distribution it will be noted that 
the values for x'^ with Yates’s correction agree very well with the 
correct values except at the extreme tails of the distribution where x'“ 
tends to overestimate the probability. For the asymmetrical distribu- 
tion the agreement is not good anyw'here in the range. In both cases it 

> IS used here to indicate that the probability is calculated from the area of 
only one tail of the distribution. As the problem is stated in terms of “15 or more 
successes,*' etc., it is obvious that only one tail of the distribution must be oonsidered. 
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will be observed that uneorreeted pves a very decided underestimate 
of the probability through practically the whole range. 

TABLE 26 

Probability of n Successes in a Sample or 16 Events 



Distribution * 

Distribution (f + 



Corrected 

Uncorrected 


Corrected 

Uncorrected 

Successes 

1 F (Bin) 

i p (x'-) 

\ p (x") 

J P (Bin) 

lP(x'*) 

hr (x*) 

16 

0 000,015 

0 000,088 

0 000,032 

0.010,023 

0.021,656 


16 

0 000,259 

0 000,577 

0 000,233 

0 063,477 

0 074,457 


14 

0002,090 

0002,980 

0 001,350 

0197,112 

0.193,248 


13 

0 010,635 

0 012,224 

0 006,210 

0 404.988 

0386,406 

0 281,837 

12 

0 038,406 

0 040,059 

0 022,750 




11 

0 105,056 

0 105,050 

0.066,807 

0.369,812 

0.386,406 

0 281.837 

10 

0 227,248 

0 226,627 

0 158,655 

0.189,653 

0 193,248 

0.124,109 

9 

0 401,809 

0.401,294 

0 308,538 

0 079,556 

0 074,457 

0.041,638 

8 




0 027,129 

0 021,656 

0.010,461 

7 

0 401,809 

0 401,294 

0.308,538 

0 007,469 

0.004,687 

0.001,946 

6 

0 227,248 

0 226,627 

0 158,655 

0 001,644 

0 000,748 

0.000,266 

6 

0 105,056 

0 105,650 

0 066,807 

0 000,285 

0 000,087 

0000,027 

4 

0038,406 

0 040,059 

0,022,750 

0 000,038 

0000,008 


3 

0 010,635 

0 012.224 

0 006,210 

0.000,004 

0 . 000,001 

■SiiiMxrril 

2 

0002,090 

0 002,980 

0 001,350 

0 000,000 

0 000,000 

0000,000 

1 

0 000,259 

0 000,577 

0 000,233 

0 000,000 


0 . 000,000 

0 

0 000,015 

0 000,088 

0 000,032 

0 . 000,000 


0 . 000,000 


In probability tests applied to 2 X 2 frequency tables, the same 
difficulties arise with regard to the appli<*.ation of as for testing the 
goodness of fit of simple ratios. Since only one degree of freedom is 
involved, the number of possible combinations of the frequencies of 
unlike probability is relatively small and the theorcti(‘al distribution is, 
therefore, definitely discontinuous. The error is not significant when 
the frequencies are large, but with small frequencies it is very decided. 
The skewness factor is not so important for 2 X 2 tables as for simple 
ratios, as the x^ curve adopts itself within certain limits to the shape of 
the theoretical distribution. After correction for continuity the remain- 
ing discrepancy may be regarded as due to the comparison between a 
histogram and a smooth curve which gives an approximate fit. 

The method of making the corre ctiomfor continuit y iff to determine 
the larger of the two pr oducts biC2 an d &2C1 . audlfpc thn larger .aubtract- 
ing 0.5 f rom t he two factora, and ^ 91 . the sm^er. adding. 9-S to the two 
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factos^. After making these corrections the usual formula may be 
applied. 

Table 26 has been prepared to show the relation between the values 
of calculated for the 2X2 table: 


12 

0 

6 

8 


using (a) a direct method for determining the exact probability, (&) 
without correction, and (c) x'^i or that obtained by using the correction 
for continuity. The direct method was devised by R. A. Fisher (1) and 
will be described below under ^'Methods of Calculation." The prob- 
ability value for the modal frequency has been omitted since it may be 
considered as belonging to either tail of the distribution. 

It will be noted that at the extreme tails of the distribution x^ tends 
to overestimate the probability, but that in the range where significance^ 
may be in doubt the agreement is fairly good. On the other hand, as 
indicated by the f P values for x^» unless the correction for continuity is 
made there is a very decided underestimation of the probability through- 
out the whole range* 

For 2X3 frequency tables, the correction for continuity is not so 
important as for 2 X 2 tables. With 2 degrees of freedom the number 
of possible combinations is much greater than for 1 degree of freedom, 
and the agreement between the smooth curve and the histogram must be 
much better. With more than 2 degrees of freedom the correction for 
continuity would hardly be necessary in any case. It must be remem- 
bered, however, that the tendency, especially when the numbers are 
small, is to underestimate the probability; and it may be necessary in 
certain cases to check the probability by direct calculation, or if this is 
impractical, by an analytical study of the larger table made by breaking 
it up into parts or condensing it into a single 2X2 table. The direct 
calculation of probabilities, even in a 2 X 3 table, is slightly complicated ; 
so that in most cases the best practice is to endeavor to make an applica- 
tion of x^ such that we are reasonably sure of a fair approximation to the 
true probability. 
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TABLE 26 

Probabiutieb fob all thb Combinations op a 2 X 2 Table 



1 

2 

P Calculated by 

Combination 

Direct 

2 

>2 


Method 

X 

X 

12 0 

6 8 

0 00192 

0 00082 

0 00325 

11 1 

7 7 

0 02828 

0 01087 

0 03084 

10 2 

8 6 

0 15585 

0 07460 

0 15475 

9 3 

9 5 

0 43707 

0 27756 

0 49346 

8 4 




10 4 




7 5 

11 3 

0 24577 

0 13251 

0 24557 

6 6 

12 2 

0 06124 

0 02459 

0 06188 

5 7 

13 1 

0 00741 

0 00241 

0 00835 

4 8 

14 0 

0 00032 

0 00012 

0 00059 


2. Methods of Calculation. Example 20. In a study of the blood groups of 
some North American Indians, Grant (2) obtained the results given in the following 
table: 


Band of Indians 

Blood Groups 

29 

14 

0 

A 

B 

AB 

Fond du lac 

Chipewyan 

18 

13 

6 

0 

5 

1 

0 

0 


31 

6 

6 

0 

43 
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It appears that pure Indians tend towards a very high percentage of individuals 
having the blood group O, but the group at Fond du lac had an obviously larger 
percentage of white blood as indicated by other characteristics. The essential prob- 
lem in this case is to test the significance of the distribution of the two bands into 
two main groups, O and nut 0 We foim, therefore, a 2 X 2 table, as below: 



0 

not 0 


Fond du lac 

18 

11 

29 

Chipewyan 

13 

1 

14 


31 

12 

43 


Either the x* test with the coi rection for continuity or the direct probability method 
would be applicable to this table In order to indicate the methods of calculation we 
shall apply the test in both ways 

(o) X® corrected for continuity’ If a 2 X 2 table is represented as f jIIows* 


hi 62 Tb 

Cl C2 Tc 

Ti T2 T 


the corrected value of x‘ ifi given by 

61C2 — cibn 

= 

TvT^ Tb 

where T/2 always reduces the numerical value of ( 61 C 2 — cihn) This is of course 
equivalent to the method described on page 103. 



Applying the corrected formula to oui example, we have 


2 (13 X 11 - 18 - V)*43 

^ “ 31 X 12 X 14 X 29 


3 0499 


Using Yules table of for divergence from independence m the fourfold table” 
(9), we look up 

>3.0 P = 0.08326 
3.1 P - 0.07829 


Difference 0.00497 

and by direct interpolation P « 0.08077 and ^ P » 0.0404. 

In order to obtain P more accurately we can make use of the fact that the dis- 
tribution of x^ is normal for one degree of freedom, and V^x^ ? f the value for 
entering tables of the probability integral. Here ^ V^3.0499 » 1.7464, and in 
Sheppard's table of the probability integral we look up 
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i - 1.74 i(l + o) - 0.9690705 
i - 1.75 §(1 + et) « 0.9599408 

Difference » 0.0008703 

and inteiixilatmg directly for t « 1.7464 we have ^(1 + a) » 0.959,6275. Since 
we want i P we take i P - 1 - i(l + «) * 0 04037. 

(5) Direct probability method for a 2 X 2 table. Repreeenting a 2 X 2 table 
as above, R. A. Fisher (1) has shown that, for any particular combination of hi, 
hi, Cl, oi, the direct probability of its occurrence is given by 

/Til Til Tbl TA/ 1 
\ T\ 

The easiest method of performing the calculations is by means of a table of 
logarithms of factorials. The different combinations that can occur are as foUows: 


and so forth 


• all other combinations having the same probability and occurring with equal fre- 
quency with one of the above. In this case, therefore, we require the sum of the 
separate probabilities of the first two combinations. These are given by: 

fSllX 121 X 29* X 14! 

L 43! ^l8!Xn!X13», 

fSUX 12!X29!X 14! 1 

L 43! ^ 17! X 14! X 12!j 

When a senes of such terms are to be calculated, labor is saved by first calculating 
the logarithm of the constant factor. The logarithms of the terms are then obtained 
by subtracting the logarithms of ihe factorials in the numerator of each term. 

In this example, log constant factor » 31.701,1693 

The logs to be subtracted are 33.201,7770 and 34.171,8139, giving: 

log term 1 - 2.499,3823 Term 1 » 0.031,678 

log term 2 - S.529,3454 Term 2 - 0.003,338 

Total - JP * 0.0349 

The values of JP obtained by the two methods are in fairly close agreement.^ 

^ The student may use this example in order to straighten out in his mind the 
reason why for certain tests it is correct to base the decision on the value of j^P 
instead of P. Actually the hypothesis being tested here is that Indians having an 
admixture of white blood do not contain a greater percentage of individuals with the 
blood group 0 than Indians that are relatively pure If the hypothesis Is stated 
differently — for example, that the two groups of Indians are random samples drawn 
from the same population with respect to the distribution of the blood group O — 
then it would be necessary to use the full value of P in order to make the test. The 
test based on the value of ^ P arises from the knowledge that the Fond du lac group 
had an obviously larger percentage of white blood than the Chipewyans. 
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Bttiiipla SL For a certain diaeaac we will anume that it has been shown that 
recovery or death is a certainty and that without treatment about half of the patients 
recover. A new treatment tried out in 16 cases gives 12 recoveries and 4 deaths. 
Is this a significant demonstration of the efficacy of the treatment? 

This problem can be solved by the direct c^culation of probabilities according 
to the binomial distribution, or since the theoretical distribution is symmetrical the 
test corrected for continuity will give a fairly close approximation. Both methods 
will be used in order to demonstrate methods of calculation. 

(a) corrected for continuity. For ratios the short formula for determining 
X^ as in Chapter IX, Section 2, is modified as follows to correct for continuity. 


(*‘ - “s* - * 2“) 


xN 


(3) 


where the theoretical ratio is : 1, oi is the actual frequency corresponding to x, and 
Of is the actual frequency correq;x>nding to 1 N is the total frequency or (ai + 02), 


X + 1 

and — - — always reduces the numerical value of (ai — otx). 


In the present example: 


X*- 


(12 - 4 - 1)^ 
16 


49 

7- - 3 0626 
16 


Fhun Yule's table of P we find }P 0.0401. The odds are about 25 ; lagainstthe 
ooeumnce of a 12 : 4 ratio due to chance alone. 

(b) Direct probability from the binomial. Let p represent the probability of 
recovery and q the probability of death. We know that p ^ q ^ i, and we require 
the first five terms of the expansion of (p + q)^ where n 16. The expansion of 
(P + 9)" !• by: 

(P + g)* - P" + nCip— + nC2p"-V + • + nCnq^ (4) 


where 


nCr 


n(n — l)(n — 2) 
1-2-3 


In our example we have: 


(n - r + 1) n! 

r rl (n — r)l 


/i ly* ^ /iy« i6?/iy« j«L/iy* , /ly* 

\2 2/ " \2/ 16!\2/ 2! 14!\2/ 3! IsAz/ 



In each term we have the constant factor (})^*. We determine the logarithm of this 
factor in^^ ordinary way and proceed to determine the logarithms of the coeffidents 
by means of a table of the logarithifis of factorials. The work is as shown in Table 27, 
which is self'^explanatory with the possible exception of the last column. The term 
values give the probabilities of obtaining in one trial the number of recoveries (or 
deaths) shown in the same line. In general, however, we do not adc that question 
We inquire, for example, as to the probability of obtaining 12 or more recoveries in a 
sample of 16, and hence we must add the probabilities for 12, 13, 14, 15, and 16 
recoveries. These summations have been performed and are given in the last column 
under the heading )P. Again, dnoe we have summated for one tail of the distribu- 
tion only, we represent the probability by ^P. 

The answer to our problem is given in the line representing 12 recoveries. The 
rorresponding value of ^P is 0.0384, and this compares reasonably well with 
)P ■■ 0.0401, obtained by the x* methodi 
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TABLE 27 


Calculation of Pmobabilitieb fhom the Binomial 


Recoveries 

B 

Log 

P-V 

Log 

Term 

Term 


16 


5.183,5200 

5 183,5200 

0.000,015 

0 000,015 

15 

1.204,1200 


4 387,6400 

0 000,244 

0.000,250 

14 

2.079,1812 

i4 

3 262,7012 

0 001,831 

0.002,090 

13 

2 748,1880 

It 

3 931,7080 

0 008,545 

0.010,635 

12 

J 260,0714 

it 

2 443,5014 

0 027,771 

0 038,406 


Example 22. In the example above, let us assume that without treatment the 
ratio of recoveries to deaths is 3 ' 1 instead of 1 : 1, and in the group of 16 patients 
receiving treatment the actual ratio is 14 : 2 Test the significance of the treatment. 

This problem differs from the first, in that the theoretical distribution is skewed, 
and what has been said about the method being remembered, it ma}' be taken for 
granted that x^ will not give a good approximation to the true probability We must 
solve this problem, therefore, by a direct calculation of the probability from the 
binomial distribution. 

Since the ratio of recoveries to deaths is 3 : 1, p » f and 9 » 41 and we must 
calculate the first three terms of the expansion of (J + J )'*• Using the formula given 
we have. 



Noting for convenience in calculation that: 



The factor is constant, and when several terms are to be calculated this trans- 
formation results in a saving of labor. 

The calculations are given in Table 28. In the JP column representing 14 recov- 
eries we have ^P = 0 1971, or the odds are only about 5 : 1 that the treatment is 
beneficial. This is an indication of a beneficial effect but it cannot in any sense be 

TABLE 28 


Calculation or Probabilities from the Binomial (} + 


Recoveries 


Log 

Log 

Term 

Term 

iP 

16 


2000,9808 

2.000,9808 

0.010,023 

0.010,023 

15 

1.204,1200 

3 623,8895 

2 727,9795 

0.053,454 


14 

2 079,1812 

3.046,7382 

T.126,9194 

0 133,635 

0.197.112 
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conflidered a proof. It would be sufficient evidence to warrant further investigation, 
but the practical aspect of such a problem must not be lost sight of, in that the actual 
gain in recoveries is very small and further investigation might best be directed 
along the line of tnals with other treatments. 

3. Selection of Method for Tests of Significance. Some confusion 
may arise as to when to apply and when to apply the direct method 
of calculating probabilities Also when applying x" the question arises 
whether or not the correction should be applied. In general these points 
can be made clear by the consideration of some hypothetical examples. 

Example 23. The following is a 2 X 4 fold table of frequencies 



A 

B 

C 

D 

I 

28 

46 

83 

126 

II 

6S 

43 

12 

1 


The numbers arc iaigo, and the theoretical frequencies in each cell are large 
The X® criterion may be apiiliod to the whole table, and no correction is required 
Example 24. If some of the numbers in a 2 X 4 fold table are small, as in the 
table below, the table mast be rearranged. 



A 

B 

C 

D 

1 

26 

84 

2 

1 

II 

94 

18 

1 

4 


Obviously the clashihcalioti of the 1 and II frequencies into C and D is ineaningless, 
and the leairangement is eithei a matter of adding these frequencies to B or elimi- 
nating them altogether Assuming that they can be eliminated we have a 2 X 2 table 



A 

B 

I 

26 

84 

II 

94 

18 


To this table it is perfectly legitimate to apply the test, and, although the numbers 
are fairly laige, the correction for continuity will improve the results slightly Obvi- 
ously it would be very laborious to make a direct calculation of the probability, so we 
would not even consider the method in this case. 
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Biample 25. We have a 2 X 2 table in which the numbers are smaU: 



A 

B 

I 

8 

2 

11 

4 

12 


For this case the direct method is the most accurate and is not difficult. 

Example 26. Given a theoretical ratio of 1 ; 1 for the occurrence of A and Bin a 
series of events, vt obtain in 100 trials 60 il's and 40 R’s What is the significance of 
this result? 

The numbers are large so that the direct calculation of the probability will be 
very cumbersome Therefore, we use x* with the correction for continuity, or we 
calculate the ratio of the deviation (also corrected for continuity) to its standard 
deviation and get the probability from tables of the probability integral. The cor- 
rection for continuity is not important, but it is bound to give a slight improvement. 

Example 27. In a test of the goodness of St of a ratio, we have a very skew 
distribution For example, the theoretical ratio of successes to failures is 15 : 1, and 
the actual results are 5 failures out of 160 events The direct method is the only one 
* that will give an accurate probability result in this case, and we must calculate the 
Iasi SIX terras of the expansion of (15/16 + 1/16)*®° When the numbers are large, 
the calculations are somewhat laborious, but in most cases it is sufficient to determine 
whether the result is or is not signihcant; and it will only be necessary in woiking 
from one end of the dislribiition to calculate enough terms such that their sum (JP) 
is 0 05 If the observed deviation is w'lthm that range it is not significant If the 
deviations in lioth directions arc to be considered, we woik from both piuIh of the 
distribution until the sum of the terms at each end is equal to 0 025 

Example 28. The theoretical ratio gives a skew distribution, but the numbers 
are siimll Calculate the piolmbility by the direct method as m Example 27 

4. Exercises. 

1. (a) Expand the binomials (J + ^)* and (J + J)*®, and calculate the value of 
each term. 

(5) If there is an equal probaluhty of the birth of male and female rabbits de- 
termine the probability in a litter of 8 of the occurience of 2 females and 6 males 
(c) Plot the histogram for the expansion of (} -f 1)*^- A bag contains white 
and black balls in the ratio of 3 white to 1 black Show that, if a sample of 12 balls 
is taken at random, the probability of obtaining 12 white balls is different from that 
of obtaining 6 or more black balls, although both cases repre.sent an equal deviation 
from the expected 9 white to 3 black 

(а) In order to check the work add all the terms and the sum should be 
very close to 1 000. 

(б) P = 0.1094 (Note that this is not a test of significance. It is merely 
a question of determining the probability of the occurrence of one particular ratio.) 

(c) Illustrates the problem of making tests of sigiuficanoe m skew distributions 

2. Koltzoff (3) performed an experiment on the control cf sex in rabbits. Sperms 
were placed in a ph 3 'siulogical solution in a tube and an electrical current passed 
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through the tube. A female impregnated with aperma taken from the anode produced 
6 femalea and 0 malea, and another female impregnated with sperms from the cathode 
produced 1 female and 4 males. Test the significance of this result. 

Using the direct method P is 0 0152 
3. From a study of the position of the polar bodies in the ova of the ferret, 
Mainland (4) gives the frequencies m the following table* 


Similar Different 


10/1 apart 

5 

1 

More than lO/i apart 

1 

6 


Test the significance of the apparent association between similarity and position 
of the polar bodies. 0 025 calculated by the direct method 

4 . Neatby (6) studied the association, in a i andom sample of lines from a wheat 
cross, of resistance to different physiologic forms of the stem rust organism. Two 
tables from his results are given below Test the significance of the association in 
each case. 

Form 21 Form 21 




SR 

S 



SR 


Form 

R . . 

28 

41 

Form 

R. . . 

46 

40 

27 

SR 

17 

15 

57 

5 .. 

0 

16 


R (resistant) iSPCsemi-resistant) S (susceptible) 
0 93 x* = 13 60 


6. Twenty-two animals are suffering from the same disease, and the seventy of 
the disease is about the same in each case. In order to test the therapeutic value of a 
serum it is administered to 10 of the ammals and 12 remain umnoculated as a control. 
The results are as follows: 

Recovered Died 


Inoculated 

7 

3 

Not inoculated 

3 

0 


Determine the probability in such an experiment of obtaining this or a result more 
favorable to the treatment. By the direct method ^P 0.0456. 

6 . An experiment is conducted similar to that in Exercise 6 but no uninoculated 
animals are available for a control. Previous results, however, indicate very strongly 
that the proportion of recoveries to deaths without treatment is 1 to 3. Again, the 
result is 7 recoveries to 3 deaths when 10 animals are treated. Test the significance 
of this result, and explain why it differs from that obtained in Exercise 5. 

}P - 0.0035. 

In the problem of Exercise 5 the theoretical ratio is itsdf estimated from the 
sample. 








CHAPTER XI 


THE ANALYSIS OF VARIANCE 

1. The Heterogeneity and Analysis of Variation. If we consider 
the variation in such a character as stature in man, it is obvious that 
this variation in general is not homogeneous. Two races may differ 
decidedly in their average stature, and the individuals of each race will 
vary around a common mean. Also, with reference to the variation 
within each race, there are regional and genetic differences between cer- 
tain groups so that even within the race the variation is not strictly 
homogeneous. In actual fact we can conclude with a reasonable degree 
of certainty that variation cannot be strictly homogeneous unless it is 
purely random, i.e., caused by a multiplicity of minor factors that cannot 
be distinguished one from another. In experimental work the hetero; 
geneity of variation is usually predetennined by the plan of the experi- 
ment. One set of results is obtained, for example, under a given set of 
conditions and another under distinctly different conditions, the object 
being to compare the two groups of results. Here the heterogeneity of 
the variation is the factor that is being tested, and the degree of its ex- 
pression determines the significance of the findings of the experiment. 
It would seem to be a necessity, therefo’-c, in studies of variation, to be 
able to differentiate the variation according to causes or groups of causes, 
especially in experimental work where such differentiation is an essential 
part of the analysis of the results. The analysis of variance supplies the 
mechanism for this procedure and in addition sets out the results in a 
form to which tests of significance can be applied. 

The points mentioned above may be made more obvious by the con- 
sideration of a theoretical example. Suppose that, for two races of men 
that we shall designate as A and B, the mean stature of race A is 66 
inches and that of race B is 68 inches. Histograms arc prepared for 
the frequency distributions of stature for the two races, and one histo- 
gram is superimposed on the other. The two distributions will undoubt- 
edly overlap, but are very likely to show two distinct peaks at the means 
of the two populations. The variation over all the individuals com- 
prising the two races could then be fairly definitely described as hetero- 
geneous. We might now endeavor to picture what the situation might 
be if we were dealing with several races instead of only two. There 
might be a number of peaks, perhaps as many peaks as there are races; 

114 
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but it is more likely that some of the groups will so nearly coincide as 
to be indistinguishable. Now that we have in mind several races, how- 
ever, it is probably easier to think in terms of the total variability of 
all the individuals concerned being divided up into two portions. One 
portion is that which occurs within all the races. To get a mental pic- 
ture of this, we might suppose the frequency distributions for all the 
, races superimposed on one another in such a way that the means of the 
flifferent races would coincide. The resulting distribution would be a 
sort of average of all the separate distributions. The second portion 
of the variability would be that resulting from the differences between 
the means, and if we had a sufficient number of these means we could 
make up another frequency distribution for them. For each type of 
distribution a standard deviation or a variance could be calculated, and 
it becomes clear at once that a comparison of two such statistics would 
be valuable in coming to a conclusion as to the degree of heterogeneity. 
To make this point still more obvious, let us imagine a series of samples 
being taken from a homogeneous population. As we have already 
learned, tliese samples will have different moans, but these differences 
will result merely from random sampling. They will be large or small 
according to the magnitude of the variation in the jiopulation from which 
they arc drawn. Tliis is a very important generalization and one which 
is fundamental to an understanding of the analysis of variance. If the 
original population has a very small variation, the nnians of the samples 
drawn from it will also have a small variation. If the population has 
a large variation, it is to be expected that this will be reflected in the vari- 
ations of the means of the samples. In fact, without going into the in- 
tricacies of an algebraic proof it seems reasonable to assume that, on 
the average, the variance of the means of the samples will be equal to 
that in ther original population, provided of course that we multiply this 
variance by the number in the samples. Thus, if the variance of the 
population is i\ the variance of the sample means is expected to be v/n, 
where n is the number of individual determinations entering into each 
mean. 

The next step in the development of these ideas is to consider what 
the situation would be if, in taking a scries of samples, we did not know 
that they were being taken from a homogeneous population. The 
variance of the population is unknown; hence it must be estimated 
from the values in the samples. The most logical estimate is that aris- 
ing from the variations within each sample, from its own mean. Sup- 
pose that this estimate is vi and the estimate of the variance of the 
sample means is V2/n. Multiplying the latter by n we have V2, which 
we shall expect to be very dose to if the population is homogeneouSi 
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but which may be very much larger than vi if the population is hetero- 
geneous and this heterogeneity has corresponded with the method of 
taking the samples. This suggests to us that there may be a technique 
here for making a test of significance. The null hypothesis is that all 
the samples have been drawn from the same population, and therefore 
that V 2 does not differ significantly from vi. For example, if we take 
the ratio V 2 /V 1 , a test of significance could be made if, for a given example, 
we could determine the proportion of the trials in which a value as large 
as or larger than V 2 /V 1 would be obtained owing entirely to random sam- 
pling fluctuations. We are indebted to Dr. R. A. Fisher for many of the 
recent developments in statistical methods, but especially for the solu- 
tion of this particular problem. If there are only two samples it will be 
noted that we have already discussed a solution, in that we may apply 
the t test to the significance of the difference between the means. How- 
ever, if there are more than two samples the t test does not apply, and we 
must use the technique of the analysis of variance as developed by R. A. 
Fisher (3). The details of this technique are best learned by the con- 
sideration of actual data. , 

2. Division of ^‘Sums of Squares”^ and Degrees of Freedom. As 
pointed out in previous chapters the variance is a measure of variation, 
and it consists of a sum of squares of deviations from the mean divided 
by the corresponding degrees of freedom. In a set of observations, if 
the total sum of squares of the deviations from the mean can be divided 
up according to some scheme suggested by the data, and the degrees of 
freedom can be divided corresi)ondingly. it is clear that a variance can 
be calculated for each group as well as for the total. It is through the 
comparison of such variance values that we obtain a true picture of the 
variation in the entire set of observations. 

With respect to the division of sums of squares, the best way to ob- 
serve this and to follow the method is to deal with actual data. The 
figures given below are yields in bushels per acre of 6 plots of wheat. 
Three of these plots are of variety A and three of variety B. 

A 27.6 32.4 23.4 

^ 19 2 18.6 16.5 

The total sum of squares is made up of the sum of the deviations of the 6 
plots from the general mean. A logical division of this total is to sepa- 
rate it into one part due to variation within the varieties, and another 

^ ''Sums of squares” written thus is an abbreviation for (sums of squares of devi- 
ations from the mean), but in general throughout this book the quotation marks are 
omitted. 
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part due to variation between the varieties. Let the general mean be t, 
which in this case is 137.7/6 == 22.95. And the mean of A is £• » 27.8, 
and the mean of £ is ^^6 = 18.1. Then subtracting 22.95 from each 
value, squaring and summating, we have: 

2(* - i)* = 185.715 

where 2! indicates that 6 deviations are summated. Now, to obtain 
1 

the sum of squares for within the varieties, we must repeat the above 
operation for each variety and add the two sums of squares together. 
Thus for A we subtract 27.8 from each of the A values and square and 
summate. This gives: 


±(x - i,)* = 40.560 

1 

and for B we have 

S(® -*»)* = 4.020 

Then 22(a: — f,)* = 40.560 + 4.020 = 44.580, where the double 

summation indicates the process of adding together the two sums of 
squares, and f < represents the mean of one group. 

The next step is to calculate the sum of squares for between the 
varieties. This is given by 

3 X [(27.8 - 22.95)* + (18.1 - 22.95)*] * 141.135 

Note that we obtain the deviations of the means of A and B from the 
general mean and then square and summate, but we mvltvply the whole 
sum by 3 because each value such as 27.8 represents the mean of 3 single 
plots. 

2 

The formula for this sum of squares will be 3 £(£< — £)*. 

Now if we add the sums of squares 

Within Between Total 

44.580 + 141.136 185.715 

2 20e - *.)* + 3 2(*. - i)* - 2{* - *)* 

II 1 1 

we note that the within and between sums are exactly equal to the total. 
That the sums of squares ran always be divided in this way is very 
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easily proved for the general case. A set of observations daatufied in 
one direction may be represented as follows 


1 

2 

Groups 3 


k 


where there arc k groups and n single observations in each group. 

For any one observationi say arn, we can write 

(xii — 5) = (aril — xi) + (xi — x) 

where 2i is the mean of group 1. Then 

(xn — x)^ = (xii — + (^1 — x)^ + 2(xu — xi)(5i — x) 

And summating for all the values in group 1 we have 

s(x - x)2 = i(x - Xi)2 ^ n{£i - f)2 + 2(xi - 2)z{x - £i) 

1 1 1 

The last term is zero because the sum of the deviations from the mefiua 
must be zero and each deviation is multiplied by a constant factor. 
The second last term is written n{xi — i)* because the factor (£i — i)* 
is constant and we merely summate it n times. Finally we have 

:E(x - x)2 = 2(ar - xi)^ + n(xi - x)^ 

Now we repeat this for each 'group, and summating ovelr all the k 
groups we have 

2(x - = SS(x- x,y + n2(x, - x)2 (1) 

1 11 1 

which is exactly equivalent to the equation given above with the actual 
sums of squares. 

The division of degrees of freedom corresponding to the sums of 
squares follows easily. In the example for two varieties we have a 
total of 5 degrees of freedom, for within varieties we have 2 in each 
group making a total of 4, and for between varieties we have only 1. 
Thus 

Total Within Between 
5 -4 + 1 


Xll 

Xl2 

XU* • 

■ X\n 

X2\ 

X22 

X2S * 

X%n 

Xn 

282 

^83 

‘X$n 

Xkl 

Xk2 

XkB- • 

* 'Xtn 
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In the general (‘aae as outlined above the degrees of freedom correspond- 
ing to the sums of squares of equation (1) are 

Total Within Between 
(nk - 1) = k(n - 1) -f (fc - 1) 


3. Setting up the Analysis of Variance. For the practical example 
with twc; varieties we can now set up an anal^’sis of variance as follows: 


Source of 
iSum of ^qiiuroa 

Sum of Squares 

Degrees of 
Freedom 

Mean Square 
or Variance 

Within varieties 

44 580 

4 

11 14 

Between varieties . 

141 135 

1 

141 1 

Total 

185 715 

5 



As would be expected from the difference between the means of A and 
B, the variance for between varieties is very high as compared to that 
for within varieties Reference to Chapter IV on tests of significance 
with small samples will recall that the variance for within varieties is 
the variance which is converted into a standard error in order to test 
the significance of the difference between the means. This variance 
can be termed, therefore, the error variance and can be used as a measure 
of the significance of the variance for between varieties. 

4. Tests of Significance. In the typical analysis of variance we 
have an error variance with which we wish to compare one or more other 
variances. Strictly speaking, all these variances are estimates of the 
true value, and this is, of course, the reason why to obtain them we must 
divide the sums of squares by the degrees of freedom. In order to under- 
stand the test of significance it is necessary to consider in the first place 
the condition that would obtain on the average if the variance we are 
testing is subject to exactly the same source of variability as the error 
variance. Let the sum of squares for error be represented by Si and 
the sum of squares for the variance to be tested by 5*2. The correspond- 
ing degrees of freedom arc ni and 712 , and the estimates of variance arc : 

<Si S 2 

Vi = V2 - — 

wi n2 

and let F = V 2 /V 1 . 

Suppose that vj represents the variance for between the varieties 
A and JB as in the actual example above. If there is no real difference 
between A and B, the differences Ulween the means that occur will be 
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due to soil heterogeneity which is the sole cause contributing to the 
error variance. On the average, therefore, vi = V 2 , or F ^ 1. But if 
the experiment, still assuming that A is not different from B, is repeated 
a number of times, F will be subject to random fluctuations and will be 
distributed in some regular manner. Thus in any one experiment if 
F = 2.6 we could judge the significance of this value if we could deter- 
mine the exact percentage of cases in which an F of 2.6 would occur as 
the result of random sampling fluctuations. The problem is therefore 
one of determining the distribution of F and tabulating the results in such 
a way that they can be used to determine probabilities. R. A. Fisher (3) 
has worked out the distribution of F and in tests of significance re- 
places it by 2 = ^ loga F. The distribution of z depends entirely on 
the degrees of freedom ni and 712 , from which the variances are estimated. 
Its use therefore does not involve any assumptions regarding the popu- 
lation and is equally applicable for large and small samples. Tables 
have been prepared giving the values of z at the 5% and the 1 % points 
for different values of ni and 712 . In comparing vi and V 2 , if we find that 
z is equal to the value ^ven at the 5% point, this means that the observed 
F value would occur owing to random sampling fluctuations in only 5% 
of the cases. 

Snedecor (11) has calculated tables of F for the 5% and 1% points, 
and this enables us to make a test of signific^ance directly without looking 
up logarithms. Table 96 is a copy of Snedecor’s table of F. 

6. Multiple Classification of Variates. In the simple example we 
have considered, the variates were classified according to variety only. 
They may, however, be classified in several ways, and it is only rarely 
that they are not classified in two or three ways. We shall consider 
two-fold classifications first. The general case may be represented as 
follows: 

Classes 

1 2 3 ■ w 


1 

XU 

Xl2 

Xu 

Xln 

2 

X2l 

X 22 

Z23 

X2n 

Groups 3 

xn 

xn 

^33 

■ JJ3n 


Xkl 

Xkt 

Xki‘ 

Xkn 


in which the variates are in k groups and n classes. The essential dif- 
ference between this arrangement and that illustrated under Section 2 




MULTIPLE GLASSinCATION OF VARUTES 


121 


above is that the variates in any one class have something in common 
in that they can be logically placed together and recognized as a definite 
unit. In field experiments the groups may be varieties and the classes 
blocks or replicates. In a chemical experiment the groups may repre- 
sent formulae and the classes different temperature or moisture condi- 
tions under which the formulae are tried. In medical or nutritional 
work the groups may be different foods and the classes different quanti- 
ties or times of feeding. 

The equations representing sums of squares and degrees of freedom 
for the twofold clasrification are as follows: 

Within Groups Between Between 

Total and Classes Groups Glasses 

Sims of ® ** 

Squai« + i)* + n S(*« -i)* + k S(f. - i)* (2) 


Degrees of 
Freedom 


nk - 1 


(n - 1) (k - 1) 


+ (*-!) 


+ (n - 1) 


•where is the mean of a group and £« is the mean of a class. Note that 
in this case the sum of squares for within groups and classes is rather 
complex and in corresponding form to equation (1) should be written 
with a triple summation. The form used, however, is more convenient 
and expresses the idea successfully. It is customary in analyses of 
experiments to refer to the within sum of squares as that due to error 
as it gives rise to the variance with which the other estimates of variance 
can be compared. 

In order to picture a threefold classification, we can assume that in 
the previous example there are m classes and n subclasses. Graphically 
the arrangement will be: 

1 2 3 m 

1 2-- n 1 2**-n 1 2-"n 1 2--'n 


1 

2 


k 


The analysis of data of this type introduces a new factor in the sums of 
squares, in that we must consider the irUeraciions of the three classes 
with one another. This is best studied, however, from actual examples, 
and the same applies to still more complex types of classifications. 
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6. Selecting a Valid Error. Significance is a relative and not an 
absolute term Differences are found to be significant or insignificant 
in relation to the variability arising from a source which is arbitrarily 
selected according to the interpretation that is to be put on the result. 
To make these points clear let us assume that an experiment is being 
conducted involving chemical determinations Two kinds of material 
arc being tested; the method is to draw samples from each kind of 
material, and in the laboratory each sample is being tested in duplicate, 
obviously here we have two sources of error. The first arises from 
sampling the material, and the second from differences between the 
results for duplicate determinations arising purely from errors in the 
laboratory technique. These two sources of error are independent and 
therefore may be of the same magnitude or widely different. If 20 
samples are taken from each kind of material the analysis of variance 


will be of the following form: 

DF Variance 

Malcrialfl (.4 and B) 

1 Vi 

Botwcoii A samples 


Between B sjiinplcs 

Between duplicates 

40 d 


Total 79 


For the purpose of this discussion it can be assumed that the variances 
a and h are of the same magnitude and can bo considered together, say 
as variance .s Now we wish to test the significance of the difference 
between the two kinds of materials, and wi* will suppose that d is very 
small in comparison to s It is not difficult to see that the variance m 
is contributed to by the variability in the sample's, or in other words 
that on the average if there is no difference between the two materials 
the variance m will be equal to the variance s. Since d is very small 
it is clear that to use it to test m is quite erroneous, as even when there 
is no difference between the materials the ratio of m to d will bo quite 
large. What will the situation be, however, if d is much larger than s7 
With a little thought it will be plain that this would be a very unlikely 
situation as s is in itself contributed to by the factors that result in the 
variance d. Putting it another way, if there is no variation whatever 
due to sampling, s will on the average be equal to d. The question 
therefore has no point, and we must consider the only other possibility, 
and that is that d and s are of about equal magnitude. The inference, 
then, is that s results largely from the differences between the duplicates, 
and that the sampling error is in itself insignificant. The obvious 
course here is to use d in order to test m, and at the same time we take 
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advantage of the greater precisioD due to d being represented by a larger 
number of degrees of freedom than s. 

Another hypothetical experiment may be considered in which the 
situation is slightly different. Again two materials are being compared, 
but it can be assumed that the material is sufficiently homogeneous that 
the sampling error is negligible There is a possibility of error in the 
laboratory technique and also there is a possibility of personal error in 
that no two operators can be expected to get exactly the same results. 
In making out the plan of the experiment it is decided that six different 
operators shall be used, all of whom perform exactly the same test on 
the same two materials. Also each operator makes his determination 
in triplicate in order that a measure may be obtained of the error in the 
technique. The anal3rsis of variance for the results will be as follows: 



DF 

Variance 

Materials 

1 

m 

Operators 

5 

0 

Error due to operators ... . 

5 

e 

Error of determination 

24 

d 

Total . 

35 



The variance e now requires some consideration in order to note its rela- 
tion to the significance of the results If wc set up the mean results for 
each operator in a table it will be of the following form : 


Operators 



1 

2 

3 

4 

5 

6 

A 

Materials 

p. . , I 

O* 

«» 

04 

06 

at 

B 

6 i 

hi 

61 

^4 


hi 


where ai, for example, represents the mean of three determinations made 
by operator 1 on material A. 

Now the variance e results from differences between such values as 
(fli “ &i) and (02 — 62). There being 6 of these values, there are 
5 degrees of freedom available for estimating the variance If each 
operator gets the same result for the difference between A and B, the 
variance e will be zero; but if the operators get widely varying differ- 
ences the variance c will be very high. Suppose now that the experiment 
is presumed to be a sample of a large population of operators making 
similar determinations on these same two materials, then the variance 
m, which represents the difference between the two materials, will be 
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contributed to by the factors that produce e\ and hence, if there is no 
difference between the materials, m will be equal to e. In sampling 
such a population, therefore, and testing the significance of the results, 
it will be necessary to use e as an error variance to test the significance 
of m. This fact may be more obvious if we consider the disastrous 
results of not using the variance e as a measure of error. The variance d 
may be quite low owing to extreme care in the standardization of the 
technique as applied to any one individual operator, and we shall assume 
that it is much lower than e. Using d as an error we find that, although 
m is very little greater than e, it is very significant if compared with d. 
The results are used therefore to prove that, for example, A gives a 
much larger result than B, and on this basis the two materials are util- 
ized in some industry for manufacturing purposes. The manufacturers, 
however, in utilizing the material may have to employ a large number of 
operators; and hence the error that was neglected in the laboratory 
creeps in and it turns out in actual practice that the two materials give 
the same result, and the so-called carefully controlled experiment of the 
laboratory is discredited. This mistake would have been avoided if the 
investigator had carefully considered the exact nature of the population* 
that was being sampled and made his test of significance accordingly. 
Of course it might happen that only one operator was used in the experi- 
ment, in which case the reader will recall the discussion of Chapter V on 
the scope of experiments and will realize that this would be another 
example of an experiment so planned that it did not have sufficient 
scope to answer the questions that it was supposed to answer. 

A point that may now be raised is this. If the error resulting from 
the determinations made by individual operators is not to be used to 
test the significance of the difference between the materials, what benefit 
is to be derived from making the determinations in triplicate and includ- 
ing the variance d in the analysis? The answer to this is that if there 
is an appreciable error in the determinations, the variance e will be con- 
tributed to by this source of variation, and hence, if there is no variation 
due to the operators, on the average e will be equal to d. The variance 
d, therefore, enables us to apply a test of significance to e; and, further- 
more, if d is appreciable, it reduces the precision of the experiment by 
making its contribution to e. In the latter case, improvement in the 
technique of the determination may result in a considerable improve- 
ment in the precision of the experiment. 

A variance such as e in the hypothetical example given above is 
usually referred to as an interaction variance. It gets this name because 
if it represents a fairly large effect it may be taken as an indication 
of an interaction between the two factors that are concerned. In con- 
sidering operators and materials, for example, we may conclude if e is 
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very large that the materials respond quite differently in the hands of 
different operators. As a matter of fact, if we are willing to use more 
than one word to describe such an effect, it might be more appropri- 
ate to speak of an interaction as a differential response. Let us assume 
that, in general, material B gives a higher result in the determinations 
that are being made than material A. This may appear more rea^ 
Bonable if we assume that A and B are not different in quality but 
in quantity, in which case it is customary to refer to A and B as repre- 
senting two different levels of one of the interacting factors. The 
more appropriate symbolism then would be to represent A and B by 
such symbols as Xi and X2, the same letter indicating that there 
are no qualitative differences between the two, and the subscripts 
indicating that this factor is at two different levels. Now if X2 gives a 
higher value in the determinations than Xi, this is plainly a case of 
response to quantity, and if there were several levels of X instead of only 
two the result would recall the phenomena observed in the study of 
regression. It is now easy to visualize what is meant by a differential 
‘ response. Some of the operators may be able to obtain the maximum 
response whereas others may obtain a much smaller response. In 
certain instances it may easily turn out that with some operators the 
response will be positive and with other operators it will be negative. 
This type of effect would be likely to result in a very large interaction 
variance. 

The meaning of interactions will be discassed in further detail in the 
consideration of actual examples. For the present it will suffice for the 
student to have a clear conception of the idea of differential responses, 
and to realize that frequently an interaction variance is in reality a true 
error variance and therefore must be used to test the significance of the 
results of the experiment. 

Example 29. Simple Classification of Vaxiates. Table 29 given the yields of 
four plots each of three varieties of wheat. We shall use the analysis of variance to 
determine the sigmficance of the differences between the varieties. 

TABLE 29 

Yields or 4 Plots Eacb of 3 Varieties 


Plot Yields 

Totals 

A 

29 2 

36.4 

22 4 

27 6 

115 6 

B 

32.7 

39 3 

28 6 

29 3 

129 9 

C 

18 7 

23 1 

21 3 

19 6 

82 7 


Total 

328 2 
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The first step is to decide on the form of the analysis and to allocate the degrees of 
freedom to each component according to the scheme decided upon. In this case we 
are concerned merely with comparing the variety variance with a variance for error, 
and the most logical error variance is one arising from within the varieties. The 
form of the analysis is therefore 


Sum of Squares DF 

Between varieties 2 

Within varieties (error) 9 


The second step is to calculate the sums of squares The best plan is first to obtain 
the total sum of squares. A formula has been given above, but this is not the best 
formula for actual calculation It is much better to make use of the identity 

nft nft 

2(* - *)* = 2(**) - -f (3) 

1 \ nk 

nl 

where Tx is the total of all the values of x or 2) ( 2 ). 

1 

Therefore we merely square and summate the actual values and subtract from 
this sum the square of the grand total divided by the number of variates The 
figures are 

Total sum of squares 9452 50 — 8976 27 = 476 23 


The calculation of the sum of squares for between varieties is carried out with the 
assistance of a similar identity 



s(r?) 

J 

n 


nk 


(4) 


where 7’, repr#*Hent« the total for a variety The formula consists therefore of squar- 
ing and suinmatiiig the totals, dividing by the number of variates enteiiiig into each 
total, and then subtracting the same term as for the total sum of squares The 
figures are 

Between varieties = 9269.16 — 8976 27 = 292 89 


To determine the sum of squares for within varieties we can perform a separate 
calculation for each variety: 

Within A * 3441 12 - 115 6V4 = 100 28 
B - 4290 23 - 129 9V4 = 71.73 
C = 1721.16 - 82.7V4 = 11.33 

Total within = 183 34 


ActuaUy it was not necessary to calculate the last sum of squares as we coula have 
obtained it by subtracting the sum of squares for between varieties from the total. 
Thus: 

Total Between Withm 
476 23 - 292 89 * 183 34 


However, when possible it furnishes a very easy check on the calculations to obtain 
the error sum of squares directly and indirectly. 
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The third step is to set up the analysifi of variance and make the tests of sig- 
nificance. This is performed in Table 30. 

TABLE 30 

Analysis of Variance 



Sum of 
Squares 

Degrees 

Freedom 

Vannnee 

F 

S BS 

iiog.F 

Between varieties 

292 89 

2 

mm 

7 19 

0 9863 

Error 

183 34 

9 

mujni 



Total 

476 23 

11 





In Fisher’s tables we look up the 5% point of z for ni — 2 and wo = 9 The value 
is 0 7242, so that the variety diflFerences here are quite signihcant Using Snedecor'-s 
tables of F (Table 96) we find that the 5% point for F is 4 26, and we of rourse reach 
exactly the same conclusion 

Example 30. Twofold Classification of Variates. In a swine-feeding experiment 
* Dunlop (2) obtained the results given in Table 31 The three rations, .4, R, and C 
differed in the substances providing the vitamins. The animals were in 4 groups of 
3 each, the grouping being on the basis of litter and initial weight For our purpose 
we shall assume that the grouping is merely a matter of replication. 


TABLE 31 

Gains in Weight of Swine Fed on Rations A, R, C 

I II III IV Totals 


A 

7 0 

16 0 

10 5 

13 5 

47 0 


Ration B 

14 0 

15 5 

15 0 

21 0 

65 5 


c 

8 5 

16 5 

9 5 

13 5 

4S 0 



29 5 

48 0 

35 0 

48 0 

160 5 



The form of the analysis is 

Sum of Squares DF 

Rations 2 

Groups 3 

Error . 6 

Total 11 

Calculating the sums of squares we have 

Total 2316 75 - (160 5)V12 - 2316 75 - 2146 6875 - 170 0625 
Rations - (47 0> + 65 5> + 48.0>)/4 - 2146 6S75 54 1250 

Groups - (29 5* + • ■ • + 48 0*)/3 - 2146 6875 87.7292 

Error » remainder 28 2083 
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Thia gives us an analysfa of variance as follows: 



Sum of 
Squares 

Degrees 

Freedom 

Variance 

F 

6% Point 

Rations. . . 

54.1260 

2 

27 06 

5 76 

6.14 

Groups 

87.7292 

3 

29 24 

6 22 

4 76 

Error 

28.2083 

6 

4.701 



Total. . . 

170.0626 

11 





The variance for rations is just significant. The meaning of the significance of 
the variance for groups depends on the manner in which the classification into groups 
has been made. We have assumed here that the groups are merely replications, in 
which case the error variance is a result of variations within groups not due to the 
rations. It is therefore valid to consider this variance as an error variance with 
which the others can be compared. The group variance, since it results from the 
plan of the experiment, is an expression of error control. If the arrangement had 
been other than in groups we would have had a simple classification into within and 
between rations. The variance for within rations would have been much larger than • 
it is according to the present arrangement, and consequently the experiment would 
have been leas precise. 

Example 81. Selecting a Valid Srron A series of 5 wheat varieties were grown 
at 4 stations and baking tests made on the flour. A sample of each variety was taken 
from each station and milled into flour. Two loaves were baked from each sample. 
The error of determination was given, therefore, by the differences between the loaf 
volumes of the duplicate loaves. These data were supplied by courtesy of the 
Associate Committee on Grain Research of the Netional Research Council of Canada. 

TABLE 32 

Duplicate Loaf VoLinfEB fob 5 Vabxetibb of Wheat Grown at 4 Stations 
(Loaf volumes in cc. — 600)/10 

Stations 

1 II III IV Totals 


1 

2 

Varieties 3 

4 

6 


Totals 

On examining the form that the analysis of variance will take, we note first that 
we must have a station variance represented by 3 degrees of freedom, and a variety 
variance represented by 4 degrees of freedom. There must also be an interaction 


7 

6 

4 

6 

16 

6 

14 

0 

16 

6 

14 

5 

19 

0 

18 

6 

no 0 

12 

6 

13 

2 

20 

0 

18 

6 

16 

0 

14 

0 

23 

8 

24 

4 

141.4 

7 

0 

1 

0 

10 

0 

8 

0 

16 

6 

14 

0 

17 

8 

18 

6 

91 8 

1 

6 

2 

0 

13 

0 

16 

0 

8 

6 

9 

0 

14 

8 

16 

6 

80 4 

28 

0 

29 

0 

19 

6 

16 

0 

10 

6 

12 

0 

22 

0 

24 

8 

161.8 


106 2 149.5 129 6 200.3 685.5 
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effect which may be regarded as the differential response of the varieties at the 
different stations. The rule for finding the degrees of freedom for an interaction is to 
multiply the degrees of freedom for the interacting factors. The interaction variance 
must therefore be represented by 3 X 4 ^ 12 degrees of freedom. There is a total 
of 40 determinations, so that there is a total of 39 degrees of freedom. The remaining 
20 degrees of freedom must represent the error of duplicate determinations, and we 
have a check on this because there are 20 pairs of loaves and since each pair gives us 1 
degree of freedom there must be 20 in all The final form of the analysis is* 


Variance 

DF 

Stations 

.... 3 

Varieties . 

4 

Interaction 

. ... 12 

Error . . . 

20 

Total 

39 


To obtain the sums of squares another table as given below is required. This 
table gives the values of (z — y) and {x + y), where x and y are taken to represent 
the paired values. 


(x - 


1 

2 

3 

4 

5 


The first half of this table may be used f(^ calculating the error sum of squares. A 
general rule for the sum of squares for differences within paired values is to use the 
identity 

Total minus between pairs « — y)* 

The two expressions on the left are S(**) — r.*/N and 2)(x + y)*/2 — T^/N. On 
subtracting and simplifying we obtain } S(x — y)\ The calculations give 

Within pairs (error) » } (93.33) * 46.66 


(» +y) 


Totals 


I II III IV 


1 

3.0 

1 6 

2 0 

0 4 

2 

0 7 

1 5 

1 0 

0 6 

3 

6 0 

2 0 

1 5 

0.7 

4 

0 6 

2 0 

0 6 

1.8 

5 

1 0 

3 5 

1 6 

2 8 


I II III IV Totals 


12 0 

29 6 

31 0 

37 6 

no 1 

26 7 

38 5 

29 0 

48.2 

141 4 

8 0 

18.0 

29 6 

36 3 

91 8 

3 5 

28 0 

17.6 

31 4 

80 4 

67 0 

35 5 

22.5 

46 8 

161.8 


106.2 149.6 129 5 200 3 685.6 
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From the second half of the calculation table we determine 


Between pairs - 


Stations 

Varieties 


20566 13 585 5* 

2 40 

10283 065 - 8570 256 
90519 03 


10 

73186 61 


- 8570 256 


- 8570 256 


Interaction Remainder 


1712 81 
481 65 

578 07 
653.09 


This [irocedure gives us a general rule for the calculation of interaction sums of 
squares. In the table considered we find the total and subtract the sum of squares 
for the two interacting factors. The remainder is the interaction 
The analysis of variance is as follows 



Sum of 
Squares 

DF 

Variance 

Stations 

481 65 

3 

160 5 

Varieties 

578 07 

4 

144 5 

Interaction . 

653 09 

12 

54 42 

Error 

46 66 

20 

2 333 

Total 

^759 47 




We now kive to decide whether we should use the variance from the duplicate 
loaf volumes or the interaction variance to test the significance of the differences 
between stations and varieties If the purpose of the experiment is to determine 
which of the varieties will give the highest loaf volume over the whole area that the 
stations sample, it will be necessary to use the interaction variance because in this 
light the stations are merely replications of the experiment. The error from dupli- 
cate loaf volumes will give an indication merely of the accuracy of the laboratory 
technique If it is large it will reduce the sigmficance of the differences, because it 
raises the value of the interaction variance. 

On comparing the variety variance with the interaction variance we get an F 
value of 2 66; and smee the 5% point is 3 26, we must conclude that, considering the 
whole area being sampled, the differences in loaf volume are not significant. In other 
words, the variation in the order of the mean loaf volumes of the varieties, from 
station to station, is so great that the differences between the means for the whole 
area may easily be accounted for by this variation. 

The interaction variance is very much higher than that arising from differences 
between duplicate loaf volumes. This means that the laboratory error is not an 
appreciable factor affecting the precision of the results in this experiment. 

Since variety tests are conducted in replicated plots at each station, it follows 
that if loaf volume determinations had been made on each plot another measure of 
error could have been obtained This error would have measured the variation due 
to soil heterogeneity; and, if the variety variance for the whole area was significant 
when compared to the pooled error due to soil heterogeneity, this would indicate that 
in general at esLch station the differences between the means of the varieties were 
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greater than could be accounted for by such sampling variation. This would not, 
however, alter our conclusion based on the test using the interaction as an error. 

Example 32. Threefold Classification of Variates. In testing out a machine 
for molding the dough in experimental baking, Geddcs, et al (5), used 3 adjustments 
of the machine, designated A, B, and C, ami tried them out on a series of 5 flours 
baked according to 2 formulae. The loaf volume daU are given in Table 33 

TABLE 33 

Loaf Volume Results in a Test of a Machine for Molding the Dough 
(Loaf volume in cc — 500) /lO 


J<"ormu]a 

1 

1 

Machine 

riours 

Totals 

iSetting 

1 

2 

3 

B 

5 


A 

9 4 

2 6 

12 3 


1.3 5 

42 4 

Simple 

B 

9 6 

3 1 

13 0 

4 3 

13 8 

43 8 


C 

9 6 

2 7 

12 4 

1 8 

13 0 

39 5 

• 

Flour 

subtotals 

28 6 

8 4 

37 7 

10 7 

40 3 

125 7 


A 

13 7 

21 6 

19 4 

13 5 

24 5 

92 7 

Bromate 

B 

12 7 

22 6 

20 6 


24 3 

90 6 


C 

12 6 

21 8 

20 9 

6 8 

23 2 

85 3 


Flour 

subtotals 

39 0 

66 0 

60 9 

30 7 

72 0 

2t38 6 


Flour 

totals 

67 6 

74 4 

98 6 

41 4 

112 3 

394 3 


On working out the form of the analysis we find that there is an additional com- 
plication here as compared to those that have been worked out pievioiiKly The 
6 rows m Table 33 represent 2 classifications, but for the pre.seiit we shall consider 
them as 6 classes giving us a simple twofold classification The form of the nnalysis 
is then: 


Flours 

4 DF 

Classes 

5 DF 

Interaction (a) 

20 DF 

Total 

29 DF 


But the 5 degiees of freedom for classes must be split up into: 

Machine settings ABC. 2 DF 

Formulae SB 1 DF 

Interaction ABC X SB, .. . 2 DF 
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Hflnoe intemstion (a) in the first analysiB is an interaction of the above three factors 
with the flours. Realising this, we can then write out the form of the analysis in full: 

Floun (1 ... 5) 4 DF 

Machine settings (ABC) 2 DF 

Formulae {SB) 1 DF 

Interaction {ABC X SB) 2 DF 

- (1 ... 5 X ABO 8 DF 

" (1 . . 6 X BB) 4 OF 

(1 ... 6 X ABC XSB)... 8DF 


Total 29 DF 

The last interaction is known as a iripU interaction. In this case it represents the 
degree to which the interaction of (ABC X SB) is different for the different flours. 
If the interaction (ABC X BB) is the same for each flour, the triple interaction will 
be sero. 

To determine the sums of squares for the components set out above it is necessary 
to set up 3 calculation tables as below: 


Machine 

Settings 

Flours 

Totals 

• 

1 

2 

3 

4 

5 

A 

23 1 

24 2 

31.7 

18.1 

38 0 


B 

22 3 

26 7 

33 6 

14 7 

38 1 


C 

22.2 

24 5 

33 3 

8 6 

36 2 

BQm 

Totals. . 

67 6 

74 4 

98 6 

41 4 

112 3 

394 3 


Formulae 

Flours 

Totals 

1 

2 

3 

4 

5 

B 

28.6 

8 4 

37 7 

10.7 

40 3 

126.7 

B 

39.0 

66 0 

60.9 


72.0 

268 6 

B + B 

67 6 

74.4 

98.6 

41.4 

112.3 

394.3 

B-B 

10.4 

57.6 

23 2 

ingm 

31.7 



Machine Settings 

Formulae 

A 

B 

C 

Totals 

B 

42.4 

43.8 

39.6 

126.7 

B 

92.7 

90.6 

86.3 

268.6 

B + B 

136.1 

134 4 

124 8 

394.3 

B-B 

60.3 

46.8 

46.8 

142.9 
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The calculations are ^ 

Total 

394 3* 

6018.43 - 6618.43 - 8182.42 

30 

- 1486.01 

Flours (1 ' • -5) 

34,182 33/6 - 8182.42 - 

509.64 

Settings (ABC) 

81,800 41/10-8182.42 - 

6.66 

Formulae SB 

(268.6 - 126.7)*/30 - 

680.68 

Interaction (ABCXSB) 

- X(S-fl)Vl0 - 680.68 - 6817.07/10 - 

680.68 - 1.16 


•Interaction (1* • -5) X (ABC) 

Total for table - 11,436.57/2 - 5182.42 - 535.86 
Flours (I - -5) - 509.64 

Settinfts (ABC) - 6.62 


Remainder (1 • • - 6) X (ABC) — 



19.60 

Interaction (1 -SIX (SB) - Z{S - 

B)V6 - 680 68 

-5369 05/6 

-680 68 

- 814.16 

Interaction (1 • - - 5 X ABC X SB) • 

■ remainder 



- 4.19 

The analysis of variance when set up in detail is 

as follows: 




Sums of DF 

Variance 


<% 


Squares 



Point 

Formulae (SB) 

509 64 

680 68 

127 4 

680 7 

243 1 
1299 0 

3 84 

5 32 

Interaction (1 - • *5 X SB) . . . 

214.16 

53 54 

102.2 

3.64 

Settings (ABC) 

6 62 

3 31 

6 31 

4.46 

Interaction (ABC X SB) 

1 12 

0 560 

1 07 

4 46 

" (1- -6X ABC).... 

19 60 

2 450 

4 68 

3.44 

“ (1 • ■ -8 X ABC X SB)I 

4 19 

0 524 



Total 

1436 01 20 





It is of* interest to make a detailed study of Example 32 from the 
standpoint of the selection of a valid error. We note first that the 
determinations were not made in duplicate so that we have no real 
measure of the error in the technique; and, if such an error is the one 
that should be used throughout for tests of significance, we shall have 
to select one of the other variances that gives us a close approximation 
of what the error of duplicate loaf volumes would be. In the second 
place it must be remembered that the primary object of the experiment 
is to study the differences in the loaf volumes due to the different settings 
of the machine and the differential responses due to these same settings. 
For this reason the analysis of variance has been separated into two 

^ Note the method used to calculate interactions for a series of paired values. 
This will be explained in more detail in the next example. 
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portions. The three effects in the first group are of no particular in- 
terest, as previous Experience would have enabled the cereal chemists 
to predict that just such results would be obtained. The separation 
of these three effects into one group VAnota resuU of the data obtained in 
the experiment^ but was preconceived, and it was decided before the 
experiment was operated that this would be done. 

Considering the variance due to the settings, the first question to be 
asked is whether or not it should be tested against a variance representing* 
purely laboratory error or against the interaction of the settings with 
the flours. The answer follows from the fact that we are concerned 
not so much with the interaction of the settings with the flours as with 
attempting to find out the best single setting of the machine for all 
purposes; and therefore we do not anticipate that, in differentiating a 
set of flours, all the settings that have been tried here will be used. 
Actually our measure of significance in this experiment must be based 
on the usual experimental error of the laboratory, because, if the machine 
settings cause differences significantly greater than those resulting from 
experimental error, it is obvious that before the machine is used for 
general purposes the most desirable setting miist be worked out. In 
other words we ought to see to it that the machine does not introduce a 
greater error into the determinations than already exists as the result of 
the ordinary procedures of the laboiatory 

On this basis it follows that the triple interaction is the most logical 
error to use, as it is the least likely to represent a significant effect and 
is not likely to be lower than the error due to differences between dupli- 
cate loaf volumes. The latter statement is the same as saying that, if 
there is no actual triple interaction effect, the variance will be equal to 
the error that would have resulted from using duplicate determinations. 

The F values with their 5% points are given in the analysis, and with 
their aid the results may be summarized very quickly. The flour and 
formula differences as well as the interaction between them are very 
large in comparison to the experimental error and may be dismissed 
with that statement. The primary interest in the experiment is in the 
settings of the machine and the interaction of the settings with the other 
factors. The settings are significant in relation to experimental error, 
and glancing at the totals we note that this must be due to the fact that 
the C setting gives a somewhat lower loaf volume than A or B. The 
interaction of ABC with the formulae {SB) is not significant, indicating 
that the differences between the settings are reasonably con^stent for 
both methods of baking. The interaction of the flours with the settings 
is significant, and we can conclude that the results with the flours are 
to a certain extent changed by the machine settings. From an inspec- 
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tion of the results this would seem to be due to flour 4, as for this flour 
the B and C settings depress the loaf volume to a greater extent than 
for the others. 

' 7. Summary of Methods of Calculating Sums of Squares. After 
the form of analysis has been worked out, the greatest difficulty that 
confronts the student of the methods of this chapter is the calculation 
of the sums of squares. Most of the methods have been dealt with in 
*tlie above examples, but it would seem to be desirable to summarize 
them under one heading. 

(a) Total for a set of n single variates, xi, X 2 » • • x.. 

« « 7 ^ 

2(x - £)^ - 2(x2) - 

1 1 n 


We square each value and summate, then subtract the square of the 
total divided by the number of vaiiates. 

( 6 ) For a set of fe groups when each group is made up of n variates. 
It there are k groups we can represent the totals for the groups as Ti, 
7^, • • • r,- • - and the means for the groups by fe, • * * • ft. 


S(7?) 
2 (i. - »)2 = 

1 71 


kn 


We square each total, summate, and then divide by the number of 
variates entering into each total. From this we subtract the square of 
the grand total divided by the number of variates. 

(c) For a set of k groups when the number of variates is not the 
same for each group. If we represent a particular series with the corre- 
sponding number of variates in each group as follows: 

Group totals Ti, T 2 , T 3 , 7\ 

Numbers a, b, c, d 

We calcfulate: 

n .^Ti .n ^ 

h c d (o + b + c + d) 

In this case we square each total and divide by the number entering into 
it. The quotients are summated, and from this sum we subtract the 
square of the grand total divided by the total number of variates. 

(d) For within and between pairs. If a set of paired values arc 
represented as follows: 
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1 

2 

3 

4- 

• -n 

Totals 

21 


xt 

24* 

••2n 

r. 

yi 

y2 

V2 

Vi 

• Vn 

r. 


T 


The sum of squares for between pairs is; 

SCj + y)» 

2 2n 


And for within pairs it is: 

|S(x - yY 

If each X and y value represents k variates we have: 

2(x + yY 


Between = 


2k 


2kn 


Within 


2(x - y)^ 
2k 


(e) For two groups only. The totals for the groups may be T, and 
Ty as above in (d) . The sum of squares is; 

(T, - Ty)^ 

N 


where N is the total number of variates. 

(f) Simple interaction in a 2 X table. The table is as in (d), in 
which each value of x and y represents k variates. The ^teraction 
(1, 2, 3, • ■ ■ n) X (xy) is given by: 

2(x - i/)2 {Ts - Ty)^ 

2k 2kn 

(g) Simple interaction for a 2 X 2 table. The following is a 2 X 2 
table in which each value of x is a total for k variates. 

A B 


I *1 *1 


II 24 21 
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The interaction {AB Xl II) is given by 

[(Xl + Xa) — (Z2 +g4)P 
4k 

(h) Simple interaction for a i X n fold table. A table of this type is 
illustrated in Section 5 above, and equation (2) shows how the sums of 

•squares and degrees of freedom are broken up. The sum of squares 
for within groups and classes is the sam^ as for the interaction and can 
be calculated by subtracting the two terms on the right from the total. 
The procedure therefore is as follows: 

Total « 2(i*) - Tl/kn 

For n classes « X(Tc)/k — T\/kn 
For k groups = 2(7'J)/n — T\lkn 

Interartion = Difference 

(i) Triple interaction. In more complex analyses it is sometimes 
necessary to calculate triple interactions. We shall illustrate the method 
for the simple case of 2 X 2 tables:' 

X Y Z 


A 

B 


A 

B 

A 

B 

Xl 

X2 

1 

Xl 

X2 

Xl 

X2 

2*4 

xs 

II 

X4 

X3 

X4 

X3 


The interaction to be calculated is (XVZ X I II X AB), Assume each 
value to he, made up of k variates; then for each of the above tables we 
have: 

For X (I II X AB) = + X 3 - xz - X 4 )V 4 fc 

y (1 II X AB) = (Xl + X3 - X2 - X4)V4fc 

Z (I II X AB) = (Xi + X3 - X2 - X4)V4fc 

Summating these gives us the sum of the interactions of (I II X AS), 

taking each X, F, and Z group separately. Next we find (I II X AB) 
for X, y, and Z combined, having set up another 2 X 2 table. 

^ If the three factors have only two levels the triple interaction is also represented 
by only one degree of freedom and may therefore be calculated from a difference 
between two correctly chosen totals. The method of building up these totals will 
be clear after a study of the methods of the following chapter. 
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X^Y + Z 



A 

B 

I 

X \ 

x% 

11 

E4 

X % 


For X, Y, and Z, (I II X AB) = (zi + xa — a :2 — X4)^/I2k, which, 
when subtracted from the sum obtained for the three tables above, 
gives the triple interaction {XYZ X I II X AB). 

According to the same principle, triple interactions may be circulated 
for any three factors. Note that there are three different ways in which 
the calculations may be carried out, as repeated calculations of any one 
of the three simple interactions will finally give the triple interaction. 
Always examine the three possible methods and decide which one will 
require the least amount of labor. 

8. Exercises. 

1. Table 34 taken from data by Crampton and Hopkins (1) gives the gains in 
weight of pigs in a comparative feeding tiial. The 5 lots of pigs represent 6 different 
treatments, and there were 10 pigs m each lot. Make an analysis of variance for 
the data, and test the significance of the treatment differences. 

TABLE 34 

Gains of Pigs in a Comparativs Febpino Trial 


Replicate 

Loti 

Lot II 

Lot III 

Lot IV 

Lot V 

1 

165 

168 

164 

185 


2 

156 

180 

156 

195 

189 

3 

159 

180 

189 

186 

L73 

4 

167 

166 

138 

201 

193 

5 

170 

170 

153 

165 

164 

6 

146 

161 

190 

175 

160 

7 

130 

171 

160 

187 


8 

151 

169 

172 

177 

142 

9 

164 

179 

142 

166 

184 

10 

158 

191 

155 

165 

149 


The error variance in ihie experiment works oni to 
2. In a study of hog prices in Iowa, Schults and Blade (9) have given prices by 
months, years, and districts. The districts are obtained by dividing the state into 4. 
A portion of the data is given in Table 35. After completing the analysis of variance 
for these data, devise graphical means of illustrating the interaetlon of months with 
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yem, it Is not neoesaaiy in this exerdse to make teats of aignifieanoe of the results, 
as It is being used here merely to show how the technique of the analysis of variance 
can be used to separate out the various effects in a set of data. 

iSum tjf sguares for months X years >■ 8S^4iS, 
S« In agronomic trials of varieties of cereal crops it is desirable to conduct the 
trials at various points in the area under consideration and to carry them on for a 
period of 2 or more years. Immer, et al. (8), have given data on barley yields at 
several stations in Minnesota over a period of 2 years. Table 36 gives the yields at 3 
of the stations for 2 years for 6 varieties. Analyze the results 

Note that the blocks are numbered 1, 2, and 3, but this does not mean that block 
1 at University Farm has any relation to block 1 at Waseca or any other station. 
Consequently the sum of squares and degrees of freedom for blocks are worked out at 
each station and lumped together in the final analysis. A common error that 
beginners make in sorting out the degrees of freedom for an experiment of this kind 
is to regard the blocks as a factor occurring at three levels and thus they have such 
expressions in their analysis as these: 

Blocks X Stations 
“ X Yearn 
'' X Stations X Yean 
etc. 

These expressions obviously have no meaning as the block numbers do not repiesent 
definite levels that are uniform at all stations The correct procedure is therefore 
to calculate the block sum of squares for each experiment and add all these sums of 
squares together in order to diow them in the final analysis. 

Thefottowing values for the sums of squares wUl aesisi in cheeking the eakulaHons. 


ToUd ll,604jei 

Varieties IMOSB 

Varieties X Stations X Years tSOM 


TABLE 35 

Hog Psicbs Paid to Producers in Iowa 1028-29 to 193B-31 



• 

192&-29 

Districts 


1929-30 

Districts 

1930-31 

Districts 


D 

B 

H 

D 

B 

B 

■ 

D 

B 

B 

C 

D 

Oct 


9.46 


9.66 

8.79 

8 98 


9 15 

8 84 

8 83 

ill 

8 86 

Nov. 

8.41 

8 13 


8.44 

8.32 


8.34 

8 53 

8.04 

8.23 

8.17 

8.45 

Dec. 

7.91 

7.85 


7 96 

8 58 


8.44 

8.54 

7.39 

7 31 

7.32 

7.34 

Jan. 

8.14 

8.2S 


8.24 

8.79 

8.69 

8.71 

MSB 


wnr 

7.11 

7.17 

Feb. 

9.14 

9 03 

KaI! 

■36 


9.59 

9.63 

9.82 

6 44 

6 62 

6.63 

6.65 

Mar. 

10.57 


10.44 



0.81 

9.78 

10.10 


6 87 

6.84 

6.88 

Apr. 

10.06 


10.56 

Iris 


9.22 

9.26 

9 42 

6.78 

6.86 

6 92 

6.92 

May 

10 36 


■(igiyj 

10 14 

RPR 



0 26 



6 06 


June 

9 95 

9 86 

9.97 

IH 


9.17 

0.14 

9 33 


5 39 

5.57 

1^ 

July 

10.64 


10 70 


RQ 

8.11 

8 31 

E:gt:i]| 


5.86 

6 16 

6.24 

Aug. 

10.35 

10.34 

10.34 



8.52 

8.68 

8.75 

5.91 

5.66 

6 24 

6.36 

Sept 

9.37 

9.46 

9.40 

9.51 

9.52 

0.52 

9.64 

0.73 

la 


5.26 

6.88 
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TABLE 36 

Yields in Bubbelb per Acre of 6 Vabietieb or Bablbt Grown at 3 Stationb 

IN Each or 2 Years 


Block 

No 

1 

Man- 

churia 

Gla- 

bron 

1 

Svan- 

Bota 

Velvet 

Trebi 

Peat- 

land 

Station 

Year 

1 

29 2 

44 6 

33 9 

36 7 

41 2 

38.5 



2 

26.0 

39 1 

39 4 

41 0 

31 9 

29 6 

University Farm 

1931 

3 

26.8 

45 5 

32.1 

42 0 

36 6 

30 2 



1 

19 7 

28 6 

20 1 

20.3 

19.3 

22.3 



2 

31 4 

38 3 

30.8 

27 5 

22 4 

30.8 

University Farm 

1932 

3 

29 6 

43 5 

31 4 

32 6 

45.5 

31.1 



1 

47 5 

55 4 

44 5 

56.9 

63 9 

41 2 



2 

52 2 

53 4 

46 0 

40 6 

63 8 

51 5 

Waseca 

1931 

3 

46 9 

56 8 

51 5 

53 2 

63 6 

53 0 



1 

40 8 

44.4 

41 0 

44 6 

53 5 

39 8 

1 


2 

29 4 

34 9 

41 1 

41 4 

44 2 

39 2 

Waseca 

1932 

3 

30 2 

33 9 

33 4 

26 2 

50 0 

29 1 



1 

24 0 

27 5 

26 5 

27 2 

42 1 

24 7 



2 

24 7 

25 5 

21 5 

28 0 

42 5 

29 5 

Morris 

1931 

3 

33 6 

33 3 

29 3 

23.2 

46 7 

35 4 



1 

29 6 

36 6 

27 1 

35 9 

40 0 

35 7 



2 

34 1 

34 3 

35.7 

33 9 

46 9 

41 9 

Morns 

1932 

3 

39 4 

34 5 

12 3 

46 7 

53 0 

! 52 0 




4. Find the 6% points of F for the following values of ni and n 2 . 


ni 

712 

3 

51 

6 

43 

4 

92 

J2 

195 

7 

36 

11 

64 

16 

39 

18 

215 

17 

19 

36 

28 

28 

154 

53 

42 
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5. Prove: (1) That 2(*^ - 7?/n - r(» - *)*. 

1 1 

(2) That the interaction for a 2 X 2 table is given by (xi + — zi — 

xii^/kn. ^ Section 7(ff). 

(3) That the sum of squares for the two subtotals Ta and Tt is given 

by (Ta - Tid*/N. See Section 7(e). 

(4) That in a series of pairs the sum of squares for within pain is 
given by ^ 2(z — See Section 7(d). 



CHAPTER XII 


THE FIELD PLOT TEST 

GENERAL PRINCIPLES AND STANDARD DESIGNS 

1. Soil Heterogeneity. The fact of soil heterogeneity as it affects 
the yields of crops has been commented on by various writers. In the 
agronomic test it is the chief source of error in comparing varieties, soil 
and fertilizer treatments, and factors of a similar type. If soil hetero- 
geneity was practically non-existent a single pair of plots would be suffi- 
dent to make a comparison of two varieties, but even then it is doubtful 
whether that condition would be hi^y desirable. By a sufficient 
expenditure we might render a piece of soil completely homogeneous, 
but by doing so we would partly defeat the purpose of the test which 
is to determine the behavior of varieties and treatments under a limited 
range of conditions. We would have selected one particular soil type 
for our experiment and therefore restricted the area to which our results 
would apply. The ideal agronomic test is one conducted on a piece of 

.land in which the range in soil type, etc., is the same as that in the dis- 
itrict to which the results are to be applied. Usually agronomic tests are 
on soil that is much less subject to variation than the surrounding dis- 
trict so that in general the results from them are considered as applicable 
over too wide an area. This is not to argue that more variable soils 
should be selected, for that might again defeat the purpose of the test 
by rendering the results insignificant, but rather to point out the limita- 
tions of the tests as ordinarily conducted and that the ideal cannot be 
reached by any method of increasing the uniformity of the soil. 

2. Replication. In order tjo obtain greater accuracy in field experi- 
ments, the most effective method is to increase the number of replica- 
tions. Increasing the plot size is also effective, but increasing replication 
is much more so. In previous pages it has been pointed out that the 
standard error of a mean is given by sly/n, where s is the standard error 
of a single determination and n is the number of determinations averaged. 

, It follows, therefore, that, in replicating field plots,, the decrease in the 
' standard error of the mean of one variety or treatment is proportional 
, to the square root of the number of rejMcalions. This rule applies only if 
• the variation due to the replicates themselves is removed from the error, 
but, as will be pointed out below, this follows naturally from the ^an of 
the test and the use of the analysis of variance. 
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A most important consideration in the use of replications is that they 
fUTQisb an estimate of the error of the experimenti and this estimate can 
be obtained in no other w%y. The error of the experiment arises fromt 
the differences between plots of the same variety or treatment that are 
not due to the average differences between the replicates. From this iti 
is clear that, if there is only one complete set of plots of all the varieties 
or treatments, there is no possibility of obtaining a measure gf r£mdom 
sqU variabili^ that can be used as an error in tests of significance. In 
terms of the theory which has been emphasized repeatedly in the previous 
pages, .the variance of the variety or treatment means is subject to test- 
ing on the hypothesis that it has arisen purely from random variations 
in the fertility of the field. Since the only way in which we can form a 
reliable estimate of these random variations is to replicate the experi- 
ment, it follows that without replication there is positively no method 
of making a test of the significance of the variety or treatment differ- 
ences. 

3. Randomization. As pointed out above, the estimate of error is 
taken from differences between plots that are treated alike. R. A. 
Fisher states that ** an estimate of error so derived will only be valid 
for its purpose if we make sure that in the plot arrangement, pairs of 
plots treated alike are not nearer together, or further apart than, or in 
any other relevant way, distinguidied from pairs of plots treated differ- 
ently.” This point is obvious if we consider a simple replicated experi- 
ment containing, say, 4 varieties, that we diall designate as A, C, and 
D. Suppose, merely for purposes of argument, that the plots are square 
and the arrangement of the plots in the field is as follows: 

Replicate I A B C D 

Replicate 2 A B C D 

Replicate Z A B C D 

Replicate ^ A B C D 

The form of the analysis will be: 

DF 

Replicates 3 

Vaneties 3 

Error . 9 

Total 15 

and now, if there are no variety differences it can be expected that on 
the average the variance v will be equal to the error e, and unless our 
experiment is dengned to make this true it is unbalanced, or in the 


Variance 

r 

V 

e 
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usual terminology it is subject to a bias. On this basis it is possible to 
picture the situation with respect to bias in this simple experiment, on 
varying the location of the plots with respect to distances between plots 
of the same variety and plots with different varieties. In the first place, 
suppose that the replicates are only 1 foot apart so that there is for ex- 
ample only a space of 1 foot between the plot of A in the first replicate 
and the plot of A in the second replicate. Then between the plots of 
different varieties there are 6-foot buffer plots of some other crop. This 
situation presents a very obvious bias in that the plots of different 
varieties are farther apart than plots of the same variety. The result is 
that, if there are no differences between the varieties, the variance v will 
on the average be larger than e. This very proposition was recognized 
by agronomists at an early stage in the development of field plot tests, 
and as a remedy for it suggestions were made as to the distribution of the 
plots in a systematic manner over the whole field. These suggestions, 
however, did not take into consideration the possibility of a bias in the 
opposite direction to that of the design outlined above. That such a 
bias is a distinct possibility has been shown by Tedin (10), in an exten- 
sive study of data from uniformity trials. A bias in the direction that 
tends to make the error too large, and the variety or treatment variance 
too small, is in effect just as disastrous as the opposite type of bias, as 
it means that, on the average, certain significant effects will be over- 
looked. 

A systematic type of distribution of the plots might be as follows: 

A B C D 

C D A B 

A B C D 

C D A B 

and it will be noted that the plots of the same variety are scattered 
widely over the field. This is the type of arrangement that is likely to 
result in an error that is too large, but, disregarding that point, there 
is another type of bias common to ail systematic arrangements. This 
may be referred to as an intravarietal bias, in that comparisons between 
different pairs of varieties are not of equal precision. For example, in 
both of the systematic arrangements that we have outlined above, the 
varieties A and B occur on adjacent plots in every replication while 
the varieties A and Z> are on the average farther apart. This is a very 
undesirable feature of such experiments, for if a single error is used for 
the whole experiment it means that real differences between the varieties 
that are close together may be overlooked and other differences that 
actually do not exist may be judged significant. 
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From the above discussion it may appear to the reader that the 
field plot test is extremely complicated and difficult to set up in such a 
way that there is no bias. Actually, aXl these difficulties may be very 
easily overcome by the simple process of arranging the varieties at ran- 
dom in each replication. Thus, instead of either of the arrangements 
that have been outlined, we would make up one as follows, in which the 
positions of the varieties are determined entirely at random. 

D C A B 

C B A D 

B C D A 

A D B C 

Then, regardless of the size or shape of the plots, it can be proved either 
mathematically or by actual trial that, in a series of such tests, using a 
different random arrangement each time, the variance v will on the aver- 
age be equal to the variance e Details of the methods used for randomi- 
zation are given in Chapter XVI. 

4. Error Control. In replicated experiments, the differences between] 
the plots of any one treatment are due in part to experimental error and 
in part to the average differences between the replicates. The latter is 
not relevant to the comparisons we wish to make, as each treatment is 
represented by one plot in each replicate or block. The variance due 
to blocks is therefore removed from the error, and, the larger the propar- 
lion of the total variability that is removed^ the more accurate the expenmetU. 
This has a very important bearing on the plan of an experiment, espe- 
cially in relation to the shape of the blocks and of the plots. The differ- 
ences between long narrow plots, when they are placed side by side, are 
usually less than those between square plots, and similarly for blocks, 
and since we want the differences between plots as small as possible and 
the differences between blocks as large as possible, the ideal plan is one 
which combines long narrow plots with ^uare blocks. P^Bctical con- 
siderations limit the shape of the plots, however, and consequently limit 
also the shape of the blocks; but, if we keep this fundamental principle 
in mind in drawing up experiments, the greatest possible efficiency will 
be obtained. 

The arrangements for error control by means of replication differ 
according to the plan of the experiment. There are two fundamental 
plans, randomized blocks, and the Latin square. Others that will be 
described later may be referred to as special types in that they are to a 
certain extent modifications of the fundamental types, and especially 
adapted to certain purposes. 
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5. Kandomized Blocks. This plan is the siinplest of all the typjM in 
which any measure of error control is obtained. It is illustrated in the 
following diagram, which represents an experiment with 8 treatments 
in 4 blocks. 



In the general case let k represent the number of blocks and n the number 
of treatments. Then the equation for sums of squares is: 

( 1 ) ( 2 ) ( 3 ) ( 4 ) 

2(a: - *)2 = n |(«» -«)* + * 2(f, - i)® + 2(d*) (1) 

where ii is the mean of a block and is the mean of a treatment. The 


last term on the right is actually 2(x — Z6 — x. + iY, but is abbre- 
viated for convenience. The corresponding equation for degrees of 
freedom is: 

( 1 ) ( 2 ) ( 3 ) ( 4 ) 

nk - 1 = (jfc - 1) + (» - 1) + (» - l)(ib - 1) (2) 

In calculating the sums of squares the following formulae are the most 
convenient. 

(1) Total 2(3: — i)® = 2(3^) — T^jvk T = grand total 

* ^ for all plots 

(2) Blocks n 2(^4 — i)® = 2(Ti)/n — 2^/nt Tt = total for 

' * one block 

(3) Treatments k S(X, — i)® = 2:(J^)/k — T, — total for 

' ' one treat- 

ment 


S(d*) 


Subtract 
blocks and 
treatments 
from total. 


(4) Error 


(1) - (2) - (3) 
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The aiudyais of variance is set up in the usual way. 
The stabdard error of the experiment is given by 


4 , 


2(tP) 

(* - 1) (n - 1) 


and for the mean of one treatment 


(3) 


6. The Latin Square. The following diagram illustrates a 5 X 5 
Latin square where the letters represent the treatments. 


E 

B 

C 

D 

A 

A 

C 

D 

E 

B 

D 

E 

B 

A 

C 

C 

D 

A 

B 

E 

B 

A 

E 

C 

D 


Note that the plots are arranged in 5 rows and 5 columns, and that there 
must be the same number of treatments as rows and columns. The 
treatments are placed at random, subject to the restriction that a treat- 
ment can occur only once in any row or column. 

Let n represent the number of rows, columns, and treatments, and 
the equations for the sums of squares and degrees of freedom are as 
follows: 

Xix - x)^ = n2(xr - x)^ + n2(Xc — x)- + n 2(x, - x)^ + 2(«P) (5) 
1.1 1 1 1 

where Xr and Xe represent the means of rows and columns respectively. 

(n2 - 1) = (n - 1) + (n - 1) + (n - 1) + (n - 2) (n - 1) (6) 

The calculations for sums of squares are: 

(1) Total 2(x — x)^ = X(x^) — T = grand total of 

all plots 

(2) Rows n2(fc — x)^ = 2(r?)/n — T^/n^ Tr == total for one 

row 

(3) Columns n2(fc — 2)^ ZiTl)/n— T^/n^ Tc total for one 

^ ^ column 
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(4) TreatmentfinZC^. — 2(^5) /w — T^jr? *= total for one 
^ ^ treatment 

(6) Error S(rf^) = (1) — (2) — (3) — (4) Subtracti rows, 

^ columns, and 

treatments 
from the 
total. 

The standard error in a Latin sqtiaie gi vi ii b;. 


a 


I ^ 

^ (n — 2) (n — 1} 


( 7 ) 


And for the mean of one treatment 


\ n 


( 8 ) 


The Latin square gives error oontiol in two directions across the field, 
BO that soil gradients arc always taken eaie of For a few treatments it 
is a very cffieieiit tyjx* of (experiment, and it is very doubtful that a 
better one can be devised When the number of treatments are more 
than 8 the Latin ^<lllare is eiimbersome and a point is soon reached 
where the incroast* in accuracy docs not warrant the added labor. 
Moreover, as the number of treatments are increased the rows and col- 
umns become longer in proportion to their width and a point is reached 
finally where further accuracy through error control is not obtained. 

Example S3. Randomized Blocks. Table 37 gives Ihe yields of 6 wheat varieties 
obtained in an experiment consisting of 4 randomized blocks. The marginal totals 
are given in the table so as to facilitate calculation 

TABLE 37 

Yields in Bushels per Acre by Blocks 
OF 6 Wheat Varieties 


Blocks Variety 

1 2 3 - 4 Totals 


A 

27 8 

27.3 

28.5 

38 5 

122 1 

B 

30 6 

28 8 

31 0 

39 5 

129 9 

C 

27 7 

22 7 

34 9 

36 8 

122 1 

Varieties D 

16 2 

15 0 

14 1 

19 6 

64.9 

E 

16.2 

17 0 

17 7 

15 4 

66 3 

F 

24.9 

22 6 

22 7 

26 3 

96 4 

Block Totala. 

143.4 

133.3 

148.9 

176.1 

601 7 
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Caleulating the auma of aquarea we have: 

Total T(a>) - T^/nJb - 16,460.05 - 16,085.12 - 137A03 

Blocka P/nk - 15,252.48 - 15,085.12 - 167.86 

VarieUea X(T^^)/k - T^/nk - 16,147.87 - 15,085.12 - 1062.75 

Error - 1,374.03 - 167.36 - 1062.75 - 144.82 


The analyaiB of variance ia then aa follows: 



Sum of 
Squares 

DF 

Variance 

■ 

6%IViiBt 

ofF 

Blocks... . 

167.36 

8 

55.79 

5.78 

3.20 

Varieties 

1062 75 

5 

212.50 

22.0 

2 00 

Error . 

144 82 

15 

9 655 



Total 

1374 93 

23 





The block and variety differences are seen to be significant, and if we wish to compare 
any two varieties we make use of the standard error. 

/— 3.122 

V 9.656 - 3.122 ^ - 1.661 

The standard error of a difference between the means of any 2 varieties is then 
1.561 X V2 w 2.2l. Now suppose that we wished to compare varieties D and F 
for which the means are 16.2 and 24.1 respectively. The difference is 7.0 and we have 


7.0 

( w 

2.21 


3.67 


From Table 94 we note that for 15 d^;rees of freedom t » 2.95 at the 1% point, 
so that the difference between the 2 varieties is very significant. We take i for 
15 degrees of freedom corresponding to the number of degrees of freedom available 
for estimating the error variance. Unless the degrees of freedom are decidedly 
limited a short cut can be employed for testing significance. From Table 94 we 
note that ^at the 5% point is approximately 2. Therefore a significant difference 
will be 2 X y/2 X Sm ^ 2.82 Sm- Roughly a significant difference is 3 s«,. 

Example 34. The Latin Square. The following is a plan of a Latin square 
which was used to test the efficiency of different methods of dusting with sulphur 
in order to control stem rust of wheat. The key to the treatments is given with the 
plan. 

Columns Ket to Treatments 

1 2 3 4 5 A • Dusted before rains. 

B « Dusted after rains. 

C — Dusted once each week. 

D ■> Drifting once each week. 

S - Check (undusted). 


I 

B 

D 

E 

A 

C 

i; 

c 

A 

B 

E 

D 

HI 

D 

C 

A 

B 

E 

IV 

E 

B 

C 

D 

A 

V 

A 

B 

D 

C 

B 
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AJl applieatioDB were 30 pounds to the acre at each treatment. Drifting means 
that the dust was allowed to settle down over the plants from above. In the ordinary 
procedure the sulphur is forced down among the plants by a blast of air. 

The plot yields in bushels per acre are given in Table 38. The figures in the table 
correspond with the position of the plots in the above plan. 

TABLE 38 

Plot Yibldb in Bubbbls ran Acrb 


Columns 


Treatiibnt 

Totals 

A 34 2 
B 32.3 
C 05.6 
D 30.8 
E 24.6 


In order to obtain the treatment totals we must sdect the yields according to 
the position of the treatments in the plan. Thus for treatment B we have 4.9 + 7.6 
+ 6.2 + 6.0 + 7 6 -* 32.3. Finally we have all the treatment totals as given in 
Table 38. 

The calculations are as given below; 



1 

2 

3 

4 

5 

Totala 

I 

4 0 

6.4 

3 3 

9.5 

11 8 

35.9 

11 

9 3 

4.0 

6.2 

5.1 

5 4 

30.0 

Rows III 

7 6 

15.4 

6.5 

6.0 

4 6 

40.1 

IV 

6.3 

7.6 

13 2 

8.6 

4.9 

39 6 

V 

9.3 

6.3 

11.8 

15.9 

7 6 

50 9 

C(dumn 

Totals 

36.4 

39.7 

41 0 

45.1 

34.3 

196.5 


(1) Total 


2(x*) - r*/n» 1829.83 - 1544.49 - 285.34 

1 


(2) Rows 


2(T?)/fi - r»/n* - 1591 16 - 1544 49 46.67 

1 


(3) Columns 


Z(r?)/n - r*/n* - 1558.51 - 1544.49 - 14.02 
1 


(4) Treatments 2(T;)/n - T*/n* - 1741.10 - 1544 49 - 196.61 
1 


(5) Error 

Then the analysis of variance is: 


.a)-(2)-(3)-(4) 



Rows 

Columns. . . 
Treatments. . 
Error 


Sum of 
Squares 

DF 

Variance 

F 

5% Point 
of F 

46 67 

4 

11.67 

4.99 

3 26 

' 14 02 

4 

3.50 

1 60 

3.26 

196 61 

4 

48.62 

20 8 

3.26 

28 04 

12 

2.34 




Total 
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7. Factorial Experimenta. As the name denotes, in factorial experi- 
ments, an attempt is made to study the various treatment factors, llius 
an experiment designed to study, at the same time, rate and depth of 
seeding of a cereal crop would be a factorial experiment in which the 2 
factors, rate and depth of seeding, are represented at 2 or more levek. 
We may use, for example, 3 rates and 3 depths, giving us in all 9 treat- 

^ ment combinations. Usually, there are more than 2 factors, as it is 
easily seen that the greater the number of factors the greater the scope, 
and inductive value of the experiment. The experiment on rates and 
depths, for example, might well be conducted with more than 1 variety, 
4IS it is conceivable that results obtained with 1 variety might not apply 
to others. In factorial experimentation, therefore, the study of the 
interactions is a very important consideration and, until the advent of 
the development of a suitable technique, was very frequently completely 
overlooked. 

The introduction of factors is of course limited by space and the cost 
of experimentation, and, in addition, it is easy to add so many factors 
that the analysis becomes rather complex. If we have to study all the 
possible combinations in an experiment with 4 factors at 3 levels each, 
we must have 81 different combinations The addition of another factor 
at 3 levels would increase the number of combinations to 243, at which 
point the experiment would become extremely unwieldy, and since the 
blocks would be very large, error control would not be highly efficient. 

If all the factors are of equal importance, the obvious method is to 
make up the total number of combinations and randomize them indis- 
criminately in each block. We shall see later that with this plan con- 
siderable increases in precision can be obtained by a process of splitting 
up the replicates into smaller imits and confounding with these smaller 
blocks certain relatively unimportant degrees of freedom. In many 
cases the factors are not of equal importance and very efficient use can 
be made of the split plot design, in which more than one error variance 
is obtained, each one appropriate for testing certain comparisons. 

8. Split Plot Experiments. An experiment was conducted in 1932 on 
the experimental field of the Dominion Rust Research Laboratory, which 
is a good example of the split plot type. This particular study was de- 
signed to determine the effect on the inddence of root rot, of variety of 
wheat, kinds of dust for seed treatment, method of application of the 
dust, and efficacy of soil inoculation with the root-rot organism. 

The plan of the experiment with the key to the treatments is given 
below and is sufficient to indicate how the experiment was worked out. 
Two varieties of wheat, Marquis and Mindum, were used. These vari- 
eties were planted in 4 blocks, half of each block being sown to one variety 
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and half to the other. The stripe were then divided into 10 plots each. 
With 5 different kinds of dust and 2 methods of application, dry and wet, 
there were 10 different treatments, and one of these was assigned at 
random to each plot in each strip. The plots were then divided length- 
wise and on one half the seed was sown with inoculated soil and on the 
other half with uninoculated soil. The final result was as shown in the 
plan of the experiment. It will be noted that the disposition of varieties, 
dust treatments, and soil treatments is purely at random throughout 
the experiment. 

In order to analyze this experiment it is necessary to sort out the 
degrees of freedom corresponding to the various components of the test. 
In the first place, for the 160 plots there is a total of 159 degrees of free- 
dom. The 160 plots are in pairs, one of each pair being inoculated (I), 
and one uninoculated (U). A convenient initial classification of the 
degrees of freedom {DF) is to consider the field as made up of 80 pairs 
of plots, and since there is one DF within each pair, we have 

Between 80 pairs 79 DF 

Within ‘‘ 80 DF (9) 

Total 159 DF 


Then, proceeding to the splitting up of the DF of these two components, 
and dealing first with the 79 DF for between pairs, we note that the units 
now arc plots exactly twice the size of the original plots, and the DF can 
be analyzed out without any reference whatsoever to the fact that ilie 
plots are divided into 1 and U portions. If the experiment is considered 
first as a test of 10 treatments replicated 8 times, the analysis would be as 
follows: 


Blocks 7 DF 

Treatments 9 DF 

Error 63 DF 


( 10 ; 


But the experiment is not actually replicated 8 times, as 4 of these blocks 


Plan of a Split Plot Expbriiient 
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Key to Treatments 

1 » Inoculated soil 
U <= Uninoculated soil. 


1 

Dry, Geresan. 

2. 

Wet, Ceresan 

3 

** Semesan. 

4. 

Semesan. 

5 

DuBay. 

6 . 

“ DuBay 

7. 

Check. 

8. 

“ Check. 

9 

CaCoj. 

10. 

CaCos 


are sownHo Marquis wheat and 4 to Mindum wheat. The 7 DF for 
blocks contain, therefore, 1 DF for varieties and 3 DF for the interaction 
of varieties with blocks, where the blocks consist now of two sets of all 
the treatments, one set with Marquis wheat and one set with Mindum 
wheat. The 3 DF for the interaction of varieties with blocks obviously 
represent the error for determining the significance of the differences 
between the varieties. The final disposition of the 7 DF as 9 ven in 
(10) is therefore: 


Blocks 3 DF 

Varieties 1 DF 

Error (1) 3 DF 


( 11 ) 


We next the 9 DF as given in (10) for treatments. The key to 
treatments shows that there are 4 different dusts and 1 check, so that 




THE FIELD PLOT TEST 


we have 4 DF for treatments. Then each dust is applied dry (D) and 
applied wet (W), so that we must have 1 DF for D W. The remaining 
4 DF represent the interaction of dusts with D W, so that the 0 DF are 
finally split up as follows: 

Dusts 4 DF 

DW IDF (12) 

Interaction 4 DF 

The effect of the varieties (V) on the factors given in (12) must also be 
considered; therefore we must have in the 63 DF for error given in 
( 10 ): 

V X Dusts 4 DF 

V X D W IDF (13) 

V X Dusts X D W 4 DF 

The 9 DF represented in (13) must obviously come out of the 63 DF for 
error as given in (10), so that there are actually only 54 DF representing 
true error. Finally the complete disposition of the 79 DF for between 
purs of plots can be shown os follows: ' 


Blocks 
V'arieties 
Error (1) 

Dusts 
D W 

Dusts X D W 

V X Dusts 

V X DW 

V X Dusts X D W 
Error (2) 


ZDF 

1 DF Group (1) 

iDF 

ADF] 

IDF 

iDF 

4 DF Group (2) 
IDF 
iDF 
5^DF 


Total 79 DF 

Error (2) is applicable to all the factors in the second group. 

TABLE 39 

Pu>T Yicldb in a Spur Plot EbcraiuinDrr 
123460789 10 


Olios OsItI 02 73 SO 07 78 09 07 60 71 64 04 75 70 66 67 05 


70 72 73 67 72 70 72 85 76 70 71 74 
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Plot TmuM m a Split Plot ExpsiinaDiT— ConImiMil 
1 23456789 10 


66 

63 

63 

51 

58 

64 

m 

67 

56 

60 

53 

61 

73 

56 

69 

55 

47 

58 

64 

55 

64 

74 

73 

72 

73 

64 

79 

68 

68 

72 

76 

69 

66 

78 

67 

63 

69 

74 

73 

76 


83 

73 

68 

60 

82 

79 

73 

81 

84 

94 

77 

76 

74 

77 

76 

73 

69 

70 

76 

88 

51 

59 

57 

67 

63 

i 

57 

61 

63 

65 

64 

61 

60 

65 

56 
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56 
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59 
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CoDffldering now the 80 DF for within pairs, the first point to note is 
that, since these 80 DF represent only differences betwem membera of 
^rs of adjacent plots, they do not contain any direct effects due to 
blocks, varieties, or dust treatments. The differences between such 
plots do represent, however, the effect of I and U corresponding to 1 DF. 
The first split up of the 80 DF is therefore: 

lU 

Remainder 

Total 


1 DF 

79 DF 

80 DF 


(14) 


The 79 DF for the remainder must contain the DF representing the inter- 
action of I U with all the other factors as given in Groups (1) and (2); 
hence wg can set these down in order. 


lU X V 

IDF 

I U X Dusts 

4DF 

lU X DW 

IDF 

I U X Dusts X D W 

iDF 

I U X V X Dusts 

iDF 

lU X V X DW 

IDF 

Total 

15 DF, 


Note that we have left out (I U X Blo cks) and the quadru ple in terac tion 
( I U X V X Du sts X D W). The form er belong toerror^ and t he latte r 
M very u nlikely to be si gnificant , and even if it might tur n out agnificant . 
ite inte i ^retation would p robably. bfiJopjCQSlslex to jtove any pr acticid 
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bearing on the use of the treatments. The final analysis of the 80 DF 
for within pairs can now be written down: 


lU 

IDF 

lU X V 

\DF 

I U X Dusts 

ADF 

lU X D W 

IDF 

I U X Dusts X D W 

ADF 

I U X V X Dusts 

4DF 

lU X V X DW 

1 DF 

Error (3) 

MDF 

Total 

910 DF 


(Group (3) 


The three groups may be placed together as one complete analysis or 
dealt with separately. It will usually be found most convenient in 
checking calculations to consider the three groups together in one com- 
plete analysis. 

After completing the sorting out of the DF the next step is to draw 
up the tables from the actual data that are necessary for calculation of 
the sums of squares. In the first place a table such as Table 39 is 
required, giving the data for the individual plots in a plan corresponding 
to the plan of the experiment. Comparing the table and the plan we 
can then draw up Table 40, which is a series of small tables .required for 
calculating the sums of squares. 

The following is an outline of the analysis of variance for the whole 
experiment, with figures in the fifth colunm indicating the calculation 
tables from which the corresponding sums of squares are obtained. 

From Table 39 we calculate the total sum of squares for all the plots. 
Then from the calculation Table 12, for the differences within pairs of 
plots, we determine the sum of squares for the 80 DF representing within 
pairs. Subtracting this from the total sum of squares gives the sum of 
squares for 79 DF representing Groups (1) and (2). 

We proceed next to calculate, from the tables, the sums of squares as 
indicated in the outline of the analysis of variance, leaving items error 
(2) and error (3) to the last. From the sum of squares representing 
within pairs for 80 DF, we subtract the first seven items in Group (3). 
The remainder is the sum of squares for error (3). From the sum of 
squares for between pairs (79 DF) we subtract the total for group (1) 
and the first six items in Group (2). The remainder is the sum of 
squares for error (2). 

The method of calculation of triple interactions has been described in 
a previous chapter. 
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t 

Sums of 
Squares 

DF 

Varisnee 
or Mean 
Square 

Caleulation 

Table 

1 

Blocks 

980.6 

3 

329.0 

1 

VBrietMB (V) 

3638.6 

1 

3638.6 

1 

Error (1) 

647.6 

3 

216.9 

1 

Dustx 

987.6 

4 

bh 

2 

Dry Vi. Wot (D W) 

117.3 

1 


2 

DuBtaX DW 

46.2 

4 

■Dll 

2 

V X Duite 

146.7 

4 

36 7 

3 

VX DW 

01.6 

1 

01 6 

4 

VXDuBteXDW. 

148.1 

4 

37 0 

6 

Error (2) 

1059.1 

64 

19 6 


Inoculated vs Uninoculated (1 U) | 

066.3 

1 

965 3 

6 

I UX V. . . 

0.3 

1 

0 3 

6 

I U X Dusts . . . 

379.8 1 

4 

06.0 

7 

JUX DW.. . .. 

68.9 

1 

68.9 

8 

I U X Dusts X D W 

26.8 

4 1 

6 4 

0 

I U X V X Dusts . . . 

119 4 

4 

29 8 

10 

I UX VX DW . ... 

3.9 

1 

3 9 

11 

Error (3) 

931.1 

64 

14 5 


Total . . , 

10,366.8 

169 




Number 

( 1 ) 


( 2 ) 


TABLE 40 

SeRIXB of StJBTABliEB FOR CALCULATING SUMB OF SQUARES 

Blocks 



I 

II 

III 

IV 


Ma 

Mi 

1361 

1454 

1178 

1408 

1229 



Ma + Ml 
Ma - Mi 

2806 

-103 

2686 

-230 



10,730 


Ce 

Be 

Du 

Ch 

Ca 

D 

W 

1044 

1039 

1069 

1046 

1062 


1 

D + W 

D - W 

2083 

6 

2104 

14 





Ce 

Se 

Du 

Ch 

Ga 

Ma 

Mi 

970 

1113 

088 

1116 

067 




( 3 ) 


Mb + Mi 2083 2104 
Mb - Mi -143 -128 


10,730 
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Number 

(4) 

Ma 

( 6 ) 

Mi 

( 6 ) 

(7) 

( 8 ) 

D 

( 9 ) 

W 


TABLE 40— (Contftfittei) 

SbRIES or SuBTABLBB FOR CALCULATING SuilB OW SQUAKBB 
Blocks 



D 

W 


Ma 

2408 



Mi 

2040 



Ma + Mi 

6438 


10.730 


Ce 

Se 

Du 

Ch 

Ca 

D 

482 

487 

480 



W 

488 

501 




D + W 

070 

088 


D- W 

-0 

-14 


D 

562 

572 

582 



W 

551 

544 




D + W 

1113 

1116 


D- W 

11 

28 




Ma 

Ml 


I 

2594 



U 

2394 



I + U 

4988 


10.739 


Ce 

Se 

Du 

Ch 

Ca 

I 

1069 

1070 

1054 



II 

1014 

1034 




i-hU 

2083 

2104 


I -u 

55 

36 



D 

W 


1 

2791 



V 

2647 



I + U 

5438 


10.739 


Ce 

Se 

Du 

Ch 

Ca 

I 

531 

535 

533 



u 

513 

524 




I + U 

1044 

1059 


I -u 

18 

11 


I 

538 

535 

521 



u 

501 

510 




I+U 

1039 

1045 


I -u 

37 

25 



10.739 


5.438 


5,301 
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TABLE 40 — (Continued) 

Sbribb of Subtablbb for Cauculatino Sums of Squares 
Number Blocks 


Ce Se Du Ch Ca 



(10) J 1 + U 970 988 4,988 

I - U 26 26 



Ma Ml 

D W D W 



I + U 2498 4,988 2940 5,751 


Differences between Pairs of Plots 


4 

3 

11 

11 

9 

1 

7 

11 

4 

2 

7 

6 

6 

1 

2 

6 

2 

13 

6 

3 



Abbreviations 

Duets: Vforietiies Stril Trealment 

Ce (Oeresan) Ma (Marquis) I (Inoculated) 

Se (Semesan) Mi (Mmdum) U (Uninoculated) 

Du (DuBay) ifelhoi Applyttit Dusl 

Cai (Cbeck) D (Dry) 

Ga (Oalduip carbonate) W (Wet) 
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CONFOUNDING IN FACTORIAL EXFBRIMBNTS 

9. Orthogonality and Confounding. F. Yates (16) has given the 
following definition of orthogonality. It is that property of tiie design 
which ensures that the different classes of e^^ects shall be capable of 
direct and separate estimation without any entanglement.” Thus, in 
a randomized block experiment, the treatments are orthogonal with 
blocks in that the effects of each are capable of direct and separate esti- 
mation. This orthogonality is accomplished in the design by seeing to 
it that each block contains the same kind and number of treatments. 
If by any chance some of the plots in one or more of the blocks are lost, 
non-orthogonality is introduced, and special methods may be required 
in order to separate the treatment and block effects. These methods, 
which have been worked out and described in some detail by Yates, 
require additional computation, and sometimes the whole procedure may 
be rather laborious. Consequently in designing an experiment we make 
every effort to keep within the requirements of orthogonality. In simple 
experiments this presents no difficulty, but in more complex ones for 
which a new design is being worked out it is quite easy unwittingly to 
introduce an element of non-orthogonality New designs, therefore, 
require careful scrutiny before they are put into practice. 

In factorial experiments involving a fairly large number of combina- 
tions, non-orthogonality is sometimes introduced deliberately, and this 
process is now referred to as confounding. The purpose of confounding 
in general, as we shall see later in more detail, is to increase the accuracy 
of the more important comparisons at the expense of the comparisons 
of lesser importance. In many instances, however, although a certain 
portion of the information concerning the comparisons of lesser impor- 
tance is sacrificed, the precision with which all the effects are estimated 
is increased to a point such that even the partially confounded compari- 
sons are more accurately estimated. 

The student should at this point make quite certain of the meaning 
of confounding, and a few elementary illustrations may be of assistance. 
Suppose that three fertilizers N, P, and K are being compared at 2 levels 
of each, so that we have 8 diffe rent combinations that we shall designate 
by NoPoKo, NoPoKi, NoPiKo, NiPqKq, NoPiKu NiPoKi, NiPyKo, 
and NiPiKi, where the subscript numbers refer to the amounts or dosage 
of each kind of fertilizer. Since NqPqKq means that no fertilizer is 
applied, and NqPoKi means that only K is applied, these terms may be 
abbreviated to 0, K, P, N, PK, NK, NP, and NPK. In these 8 com- 
binations it will be noted that we have 4 without N and 4 with N. If 
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now we divide the blocks ordinarily containing 8 jjots into halves such 
that one half contains the treatments 0, K, P, PK and the other half 
N, NK, NP, NPK, then the effect of N which may be represented 
algebraically by {Ni — No) is completely confounded with block effects. 
The other main effects are still orthogonal with the blocks. For example, 
in each block we have 2 plots containing P and 2 plots that do not 
contain P. We would not consider a design of this type in actual 
practice, as it defeats what is obviously one of the main purposes of 
the experiment. Assuming, however, that accuracy can be gained by 
reducing the size of the blocks, it may be worth while to examine all the 
comparisons to see whether certain of these may be deemed sufficiently 
unimportant to be sacrificed in order to increase the precision of the re- 
maining comparisons. 

The treatment effects may be set out as follows with the correspond- 
ing degrees of freedom. 

N 1 DF 

P 1 DF Main effects, 3 DF 

K 1 DF 

NX P IDF 

N X K 1 DF Simple interactions, 3 DF 

PXK IDF 

N X PX K 1 DF Triple interaction, 1 DF 

To the best, of our judgment the triple interaction N X P X K would 
seem to be the least important. At least, even if significant in effect it is 
the most difficult to interpret in terms cf actual fertiliser practice. We 
shall decide, therefore, to confound this one degree of freedom with 
blockS) and it remains only to determine the distribution of the treat- 
ments ih the blocks in a manner which will confound this one comparison 
and leave all the others intact. Algebraically, all the treatment effects 
can be represented as follows — 

JV = (ATi - No) {Ki 4- A'o) (Pi + Po) 

P = (ATi + No) (A, + Ao) (Pi - Po) 

K^{Ni+ No) (Ai - Ao) (Pi + Po) 

A X P = (Ai - No) {Ki + Ao) (Pi - Po) 

A X A - (Ai - Ao) (Ai - Ao) (Pi + Po) 

P X A = (Ai + Ao) (Ai - Ao) (Pi - Po) 


A X P X A « (Ai - Aq) (Ai - Ao) (Pi - Po) 
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and on expanding the last expression we have — 

wv P V jr - 1 +^oPoKi + NiPiKi + NoPiKo + iViPoltol 

A A A - I _ N^p^Ko - NoPiKi - JViPoiCi/ 

^ 1+K + NPK + P + N 

^ [-0 - NP - PK - NK 

This means amply that, if we let the symbols represent the actual yields 
from the corresponding plots, the sum of squares for the triple inter- 
action will be given by 

[(i\r + P + If + NPK) - (O + NP + PK + ATIOP 

where k is the number of plots represented in each total such as (0 + NP 
+PK + NK), Now if we divide each replication into 2 blocks and in 
one of these put the treatments 0, NP, PK, NK, and in the other, 
N, P, K, NPK, then the above sum of squares will contain not only the 
triple interaction effect but also the effect of the blocks. The 1 degree of 
freedom for triple interaction will have been completely confounded 
with blocks. The analysis of variance for the experiment, assuming 
4 replications, will be of the form 


Blocks 

.... 7 DF 

Main effects 

S DP 

Simple interactions . . . 

... S DP 

Error 

ISDF 

Total 

31 DP 


Since 7 DF have been utilized for error control instead of 3 as in an 
ordinary randomized block experiment, with a moderate degree of soil 
heterogeneity, it may be expected that the remaining effects will be 
estimated more accurately by the confounded experiment than by the 
randomized blocks.* 

10. Partial Confounding and Recovery of Information. The pro- 
cedure illustrated above resulted in the complete sacrifice of the infoi^ 
mation on the triple interaction, and it may be argued that, regardless 
of the apparent unimportance of the information sacrificed, this is not 
good experimental procedure in that the experimenter is taking too much 
for granted in attempting to forecast a result on which he has no previous 
information, and using this as a basis for the experimental design. The 
difficulty can be overcome by a process known as partial confounding, 
which amounts to confounding different degrees of freedom in different 
replications and using the results from the blocks in which the particular 
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effects axe not confounded to recover a portion of the information de- 
sired. In order to partially confound the experiment described above 
and at the same time recover a portion of the information on all the com- 
parisons, we shall require at least 4 replications. In each replication 
we can confoimd with blocks a degree of freedom from one of the inters 
actions. The method of laying out the treatments in the blocks is ob- 
, vious if we expand algebraically each of the expressions for the inter- 
actions. Thus 


I (i^XPXA)«(iVi-W(Pi-Po)(Ai-Xo) 

II (NXP) «(iVi-iVo){Pi-Po)(A'i-f-/iCo) 

III (NXK) «(Ari-No)(Pi+Po)(Ai-Ao) 

IV IPXK) ^(Ni-^No)(Pi-Po)(Ki-Ko) 


/+A+iVPA:+ P+ n\ 
\~-0- ATP- PA- NK/ 

/ + 0+ NP+ K^NPK\ 
P-NK-^ PK) 

/+0+ NA+ P+NPK\ 
\-N- A-AP- PA/ 

/+0+ PA+ A+APA\ 
\-P- K-^NP- NK/ 


Then in the first replication we can confound the triple interaction and 
conserve it in all the remaining replications. In the second replication 
we can confound the simple interaction N X P and conserve it in all 
the remaining replications. With 4 replications we can confound each 
interaction in 1 replication and conserve it in all the others. 

In recovering information with respect to the interactions it will, of 
course, be necessary to make the desired comparisons only in those 
replications in which the particular interactior is not confounded. Thus 
if we are computing the sum of squares for N X P we omit replication 
II entirely and make up our totals from the other three. The final 
analysis will be of the form : 


Blocks 7 DP 

Mam effects 3 DP 

Simple interactioDS 3 DF 

Triple interactions 1 Df 

Error 17 dF 

Total 31 DP 


The residt of this procedure is to sacrifice one-quarter of the information 
on each interaction, but the main effects and that portion of the informa- 
tion with respect to the interactions that is recovered may be 
with greater accuracy. 

Using a set of figures from uniformity data the procedure for 
and analyang a partially confounded 2X2X2 experiment is iUua- 
trated in Example 35. 
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Example 8S. Partial Confoiindiiig in a 2 X S X S Bqiailment. 


TABLE 41 

Plan of Fiiiu> Showing Location or Tbbatiiintb and Corbhbponding Ynu>a, 
FOB A PaBTIALLT GonFOTTNDIID 2X2X2 ExPBRimiNT 


Rqdication 

No. 

Treat- 

ment 

Yidd 

Treat- 

ment 

Yield 

Traatr 

ment 

Yield 

Treat- 

ment 

Yield 



NK 

160 

P 

163 

0 

146 

K 

180 



0 

170 

NPK 


PK 

101 

P 

272 


I 

PK 

136 

N 

163 

NK 

mrm 

N 

160 



NP 

130 

K 

182 

NP 

240 

NPK 

305 








876 


026 

3,006 


NPK 

166 

N 

101 

0 

226 

P 

266 



NP 

120 

NK 

138 

K 

160 

NK 

miom 


II 

K 

161 


188 

NPK 

240 

PK 

233 



0 

160 


210 

NP 

182 

N 

278 




504 

■ 

727 


807 


1077 

3.206 


P 

164 


143 

P 

186 

N 

200 



NK 

77 

NP 

110 

NPK 

173 

K 

03 


III 

0 

02 

N 

116 

0 

170 

PK 

224 



NPK 

128 

PK 

179 

NK 

213 

NP 

245 




451 


556 


742 


771 

2,620 


N 


m 

136 

PK 


K 

203 



PK 

mSm 


107 

0 

mvlm 

NK 

226 


IV 

NPK 

186 

■Q 

182 

NPK 

166 

NP 

248 



0 

148 

NP 

212 

N 

183 

■1 

269 




573 


727 


606 

■ 

1036 

3,032 








■1 


11,862 


Table 41 gives the location of the treatments in the field and the corresponding 
3 ields. The latter were taken from uniformity data as the results from an actual 
experiment were not available. Note that the replicate numbers (actually two 
replicates) correspond with the numbers given opposite the expansion of the inter- 
actions on page 163. Thus in replicate I the triple interaction NXPXK is con- 
founded wi^ blocks, and so forth for the other interactions in the remaining r^lloa- 
tions. Within each block the treatments are assigned to the plots at random. 

In Table 42 the treatment totals are arranged in a convenient form for the 
calculation of sums of squares. For example, in calculating the triple interaction 
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TABLE 42 

Thbatunt Totals Requirbd fob Calculation 
OF Sums of Squares 


Treatment 

All 

Replications 

Minus 

Replication 

I 

Minus 

Replication 

II 

1 

Minus 

Replication 

III 

Minus 

Replication 

IV 

0 

1294 

970 

909 

1032 

971 

N 

1402 


033 

1078 

1106 

P 

1624 

1199 

1170 

1284 

1219 

K 

1392 

1021 

1082 

1166 

917 

NP 

1606 

1135 

1194 

1141 

1045 

NK 

1610 

1161 

1172 

1320 

1187 

PK 

1481 

1155 

1038 

1078 

1172 

NPK 

1644 

1037 

1149 

1243 

1203 


N X PX K we must use the totals from the replicates in which this interaction it 
not confounded. These are given in the third column, and we find 

NXPXK^ (1021 + 1037 + 1199 + 1089 - 970 - 1136 - 1165 - 1161)*/48 - 88 

Similarly the interaction PX K ia calculated from the totals in the sixth column 

PXK (1219 + 917 + 1046 + 1187 - 971 - 1172 - 1106 - 1203)V48 - 147 

The main effects are of course calculated from all the replicates, so we make use of 
the totals in the second column. 


TABLE 43 

Com PUTS Analysis for Partially Confounded 
2X2X2 Expbrimbnt 



Sums of 
Squares 

DF 

Mean 

Square 

Blocks . 

112,462 

15 

7,497 

N 

1,139 

1 

1,139 

P 

3,249 

1 

3,249 

K 

638 

1 

638 

NXP. , 

9 

1 

9 

NXK . 

3,781 

1 

3,781 

PXK . ... 

147 

1 

147 

NXPXK. 

88 

1 

88 

Error 

61,895 

41 

1,510 

ToUl 

183,408 

63 
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11. SpUtting up Degrees of Freedom into Orfliogonsl Components. 

Before considering the problem of confounding in experiments of a mote 
complex tyi)e, the student should acquaint himself with t^ methgds 
of separating effects representing more than 1 degree of freedom into 
component parts t hat are mutually independent a nd therefore mav b e 
s eparately estimated from the data . Thus if we have 3 levels of nitrogen 
in a fertiliser experiment, there are 2 degrees of freedom representing 
the effect of nitrogen. These 2 degrees of freedom mav ha 
with their appropriate sums of squares in an infinite number of wav s. 
I^t iiqJfWH fmparatinn ia a piirnl y formal one we will nrobahlv wiah 
to separate them in some wav such that t hey will renraafttit. Hafinite 
relative to the in ter pr eta tion of the expe rimen t. In the case of the 3 
levels of nitrogen Ni, N 2 , and N 3 , the 2 degrees of freedom can be 
expressed by 

(o) Ni - Ni 

(h) 2 N 2 -Ni- Ni 

and in this form (a) represents the linear effect of N on 3 rield, and (b) the 
quadratic effect. If the yields are represented graphically, (b) will be 
zero if the 3 points lie exactly on a straight line. These two expressions 
merely bring out the fact that any 2 points can be fitted by a straight line 
function, and any 3 points by a quadratic function. Any other division 
of the degrees of freedom that we might make would probably not have 
as valuable a meaning as this one, although if one felt quite certain that 
Ni was a decided overdose of nitrogen one might wish to measure the 
linear effect by N 2 — Ni, and the quadratic effect by 2 Ni — Ni — N 2 . 
In general, however, the expressions such as (a) and (b) are the most 
useful. 

If we have 4 levels of nitrogen the 3 degrees of freedom may be 
divided: 

(c) 2Na + Ni — Ni — 3Ni Linear term 

(d) Na — Ni — N 2 + Ni Quadratic term 

(e) Na — 3Ni + ZNi — Ni CuIhc term 

The rule for writing out the expressions for the division of degrees of 
freedom is to see to it that in each expression the sum of the coefllcients is 
zero, and for any pair of expressions the sum of the products of the 
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corresponding coefficients is zero. Thu-s, in the set immediately above, 
the sums of the coefficients are 

(c) 3 + 1 - 1 - 3 = 0 

(d) 1 - 1 - 1 + 1 = 0 

(e) l-3 + 3- l= 0 

Then multiplying the coefficients: 

(cXd) 3-l + l- 3 = 0 

(cX«) 3-3-3 + 3 = 0 

(dXe) l+3-3-l = 0 

We must remember, however, that if we wish to write the polynomial 
expressions as has been done here there is only one set that can be 
written. 

The sum of squares for any one of the above expressions may be cal- 
culated by means of a simple rule. For example, if we have the expres- 
sions (a) and (b) the sums of squares are 

(a) - {Nz - Ni)^ (b) i (2N2 -Ni- Nz)^ 

where the numerical portion of the divisor comes from summing the 
squares of the coefficients within the bracket. The value of k comes 
from the number of units entering into each subtotal. For example, in 

(b) , Ni, N 2 , and may represent subtotals from 8 plots, whence the 
complete divisor is 48. 

An aptual example of the division of 3 degrees of freedom according 
to the scheme outlined above is given by Yates (17). The figures are 
for response to nitrogen, and the results of the analysis are reproduced 
below: 

DF S8 

Linear term 1 19,636.4 

Quadratic term 1 480 5 

Cubic term 1 3.6 

Total 3 20,020.6 

When compared with the error of the experiment, the quadratic term 
turned out to be insignificant, and the cubic term was below expecta- 
tion. Undoubtedly, this type of result is quite usual in agricultural 
experiments, and since we can separate out not only main effects in the 
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above manner but also interaction effects, it follows that if a portion of 
the degrees of freedom for an interaction effect is to be sacrificed by 
confounding it is desirable in general to sacrifice that portion that is least 
likely to be significant At any rate, it may be wise to ensure that at 
least the interaction between linear effects may be partially recovered 
from the confounded experiment. 

If the interaction between nitrogen at 2 levels and potash at 2 levels 
can be represented by (N2 — JVi) (K 2 — Ki) it follows that, if there are 
3 levels of nitrogen, the interaction N X K can be broken up into two 
parts: 

iK 2 - Ki) (Ns - Ni) and (K 2 - Ki) ( 2 N 2 - Ni - Ns) 

where the second expression represents the interaction of the quadratic 
effect of nitrogen with potash. This point may be more obvious if we 
consider (2N2 — iVi — N3) as representing deviations from linear 
regression instead of the quadratic response, and hence the interaction 
may be written as K regression X N deviation or Kr X Na. Now if we 
have 3 levels of potash as well as 3 levels of mtrogen the 4 degrees cf 
freedom for the interaction N X K may be broken up as follows: 


Nr X Kr (Nt - N,)(K$ - Ki) 1 DF 

Nr X Ki (Nm - N{)(2Kt - ffi - As) 1 DF 

Ni X Kr (2Nt - - NzHKz - Ki) 1 DF 

Ni X Ki (2iV, - - NzK2K2 -Ki- Kz) 1 DF 

NX K 4 DF 


and it may be of interest to do this from the standpoint of obtaining 
complete information with respect to the interaction. Yates (17) has 
given a useful table for calculating the sums of squares, which is repro- 
duced below in Table 44. 


TABLE 44 


Guide for Calculating Sums of Squares for the 
iNTBRAmONB IN A 3 X 3 TaBLE 


Nr XKr Nr X Ki 

Ki Ki Ki Ki Ki Ki 


Ni 

+ 1 

-1 

Ni 

-1 +2 -1 

Nt 



Nt 


N, 

-1 

+1 

N, 

+1 -2 +1 


Divisor 4A; 12^ 


Ni XKr NiX Ki 

Ki Ki Ki Ki Ki Ki 


AT, 

-1 

+1 

Ni 

+1 -2 +1 

Nt 

+2 

-2 

Nt 

-2 +4 -2 

Nt 

-1 

+1 

Nt 

1 

+1 -2 +1 


I2k 3(Mb 


Number of units in each cell. 
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To use the table it is necessary to set up a table of subtotals in the same 
form as the above squares. The subtotals are added or subtracted 
according to the signs in the appropriate table Thus if the subtotals 
are represented by 



X2 

X3 

y\ 

F2 

2/3 

Z\ 

Zl 

23 


we get the sum of squares for Nd X Kr by 


12k 


(2yi + ara + 23 — ^1 — ^1 — 2 ^ 3 )* 


In certain cases it may not be necessary to divide up the degrees of 
freedom into orthogonal components that have any definite meaning, in 
which case we refer to the division as a purely formal one. A 3 X 3 
table, for example, may be represented as follows: 



Pi 

Pt 

Pi 

iVi 

11 

12 

13 

Nt 

21 

22 

23 

N, 

31 

32 

33 


and from knowledge that has been derived from a study of the properties 
of the Latin square, Fisher, (2), it can be shown that the 4 degrees of 
freedom representing the interaction N X P can be split up into two 
orthogonal components by making up totals from the diagonals of the 
above square. Thus 2 degrees of freedom of the interaction is repre- 
sented by the differences between the totals (11 + 22 + 33;, (21 + 32 + 
13), (31 + 12 + 23), and the other 2 by the differences between the 
totals (11 + 32 + 23), (21 + 12 + 33), (31 + 22 + 13). As a matter of 
fact this provides a very useful method of calculating the interaction in 
a 3 X 3 table as it is a direct method and the total sum of squares cal- 
culated independently from the same table may be used to obtain a 
perfect check on all the calculations.^ The division of the 4 degrees of 
freedom is, however, purely formal. In other words, we would expect 
that on the average the two components would give us equal estimates 
of the interaction variance. 

^ Note that the second set of totals can be obtained most easily by setting up the 
numbers in the first three totals in the form of another square, and taking from this 
square the same diagonals as were used in the first instance 
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12. Confounding in a 3 X 3 X 3 Experiment We ahall now con- 
sider the posfflbilities of confounding in a 3 X 3 X 3 experiment. The 
3 main factors can be represented by N, K, and P, and since each 
of these occurs at 3 levels there are 27 different combinations. The 26 
degrees of freedom for treatments can be subdivided at first as follows: 

N 2DF 

K 2 DF ' Main effects 6 DF 

P 2DF 

NX K A DF 

NX P 4 DF -Simple interactions 12 DF 

KXP 4 DF 

N X K X P 8 DF Triple interaction 8 DF 

Now if we wish to conserve the main effects and the simple interactions 
we must have at least 9 plots in each block. That is^ the 3 levels of 
each fertilizer must each be represented by 3 plots, and the 9 combina- 
tions of each pair of fertilizers must each be represented by 1 plot. The 
required combinations to fulfill these conditions are given by a 3 X ,3 
Latin square in which the rows may be taken to represent the 3 levels 
of nitrogen, the columns the 3 levels of potash, and the Latin letters 
(here replaced by numbers) the 3 levels of phosphate. R. A. Fisher, 
in introducing this solution, points out that there are only 12 solutions 
of this 3X3 square and that these 12 fall into 4 sets such that in any one 
set the other 2 may be generated by cyclic substitution of the numbers in 
the square. The entire 12 solutions av'e reproduced below. 



1 

2 

3 

2 

3 

1 

3 

1 

2 

1 

2 

3 

1 

3 

1 

2 

1 

2 

3 


3 

1 

2 

1 

2 

3 

2 

3 

1 


1 

3 

2 

2 

1 

3 

3 

2 

1 

11 

3 

2 

1 

1 

3 

2 

2 

1 

3 


2 

!• 

3 

3 

2 

1 

1 

3 

2 


1 

2 

3 

2 

3 

1 

3 

1 

2 

III 

3 

1 

2 

1 

2 

3 

2 

3 

1 


2 

3 

1 

3 

1 

2 

1 

2 

3 


1 

3 

2 

2 

1 

3 

3 

2 

1 

IV 

2 

1 

3 

3 

2 

1 

1 

3 

2 


3 

2 

1 

1 

3 

2 

2 

1 

3 


To make the meaning of these squares perfectly clear, suppose that we 
consider the treatments represented by the square I (a). These are, 
NiKiPi, N 1 K 2 P 29 NiK^P^t N 2 K\P%f N 2 K 2 p 2 t ®tc. In any one replica- 
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tion we must have all the treatments one complete set such as 1, 1I| 
III, or IV, and within the replication the division of the treatments into 
blocks is according to the division of the sets into (a), (b), and (c). In a 
sin^e replication we have 2 degrees of freedom for blocks, and these 
must represent 2 degrees of freedom of the triple interaction that have 
been confounded, as we have seen to it that the main effects and the 
simple interactions have all been conserved. It follows also that it is 
impossible, if the main effects and the simple interactions are conserved, 
to confound more than 2 out of the 8 degrees of freedom of the triple 
interaction. Such being the case, we shall still have, after confounding, 
6 degrees of freedom for the triple interaction, which we may use to test 
the significance of the residual portion of this effect. 

The actual procedure of confounding in an experiment of this kind is 
to set up the treatments and divide them into blocks according to one of 
the cyclic sets. The same division of the treatments into blocks is 
retained throughout the remaining replications. In analyzing the 
results, if set I has been used for confounding, then sets II, III, and IV 
are used to build up the treatment totals from which the sum of squares 
for the triple interaction is calculated. The details of this are given in 
Example 36. 

13. Partiai Confounding in a 3X3X3 Ei^riment. By the 
methods described above we are able to divide the 8 degrees of freedom 
for the triple interaction into 4 sets of 2 that are mutually independent 
and therefore may be separately estimated from the data. But these 
sets represent purely formal differences, and although we confound only 
2 of them and conserve 6, we are not able to separate out particular 
effects such as that represented by Nr X Kr X Pr for particular study. 
To do this we must adopt the method of partial confounding which 
results /rom using each of the cyclic sets once, one for each replication. 
We require therefore a minimum of 4 replications. Space is inadequate 
here to go into detail regarding the method of separating out the particu- 
lar components, but the student interested in these further asx)ectB of 
confounding will be able to obtain further information from R. A. 
Fisher’s 'The Design of Experiments,” and from the monograph by 
F. Yates, “Factorial Experimentation.” 

< E|lunpl6 36. A Confounded 3X3X8 Experiment. In the preparation of this 
^a^riple, data from a uniformity trial have been used. It eenres therefore merely 
toihow the technique of setting up and analyzing a 3 X 3 X 3 experiment in whieh 
2 dogreee of freedom from the triple interaction have been confounded with blocks. 

As indicated in Table 45 giving the treatment numbers and the corresponding 
yields, the distribution of the treatments into the 3 blocks of each replication is 
according to qydio set I as described above. In order to abbreviate, only the sub- 
script numbers of the treatments are given, it being assumed that the three ingredients 



172 


THE FIELD PLOT TEST 


■ueh M NKP are in the same order in eadi case. Within the Uocke the treatmenti 
are, of couiee, randomized. 

Table 46 is obtained by collecting the plot yields from Table 45. It is used for 
calculating the main effects and the simple interactions. At the foot of this table are 
given the treatment totals from which the sum of squares for the 6 degrees of freedom 
for the triple interaction is calculated. These treatment totals may be obtained veiy 
quickly by the combined use of the cyclic sets as given on page 170 and the 3X8 
tables for N and X, one for each levd of P. Knowing that set I has been used for 
confounding, we obtain our treatment totals for calculating the triple interaction, 
from the application of sets II, III, and IV, to the data given in Table 46. For 
example, taking set II we note that the I’s in group (a) correspond in Table 46 (a) 
with 1604, 1523, and 1912; the 2’s correspond in Table 46 (b) with 1893, 2030, and 
1846; and the 3*a in Table 46 (e) with 1741, 1838, and 1917. Adding all these values, 
we obtain 16,303. Then to obtain the next total the same process is repeated, using 
the square indicated by II (b), and finally the third square, II (c), gives the third 
total. The sets III and IV are then used in a similar manner to obtain the remaining 
totals. The sum of squares is calculated for each set of 3 totals and these are added 
to give the sum of squares for the 6 degrees of freedom of the triple interaction. 

Mainly as an exercise, the sums of squares for the individual degrees of freedom 
as represented by the regression and deviation from regression effects have all been 
calculated and are shown in the analysis of variance Table 47. These calculations 
are very simple if one makes use of Yates’s diagram as given on page 168. A few of 
the calculations are reproduced below for further guidance: 


Nr (16,393 - 10,900)Vl44 « 15,771.17 

Nr X Kr (6403 + 5706 - 5376 - 4894)* /96 - 7,332.61 

Nd X Pr (2 X 6244 + 6057 + 5696 - 4812 - 5667 - 2 X 5297)*/288 = 16 06 

Nd X Kd (6403 + 6376 + 4 X 6436 + 4894 + 6706 - 2 X 6520 

- 2 X 5096 - 2 X 6818 - 2 X 6074)*/864 - 13.26 


METHODS FOR TESTING A LARGE NUMBER OF VARIETIES 

14. General Principles. In factorial experiments, when the total 
number of combinations is fairly large, we have seen that greater accu- 
racy can be obtained by confounding with blocks certain of the degrees 
of freedom for the higher-order interactions. In variety experiments the 
numbers are frequently quite large and we again meet with the problem 
of insufficient accuracy owing to the large size of the blocks. In order 
to overcome this difficulty Yates has developed methods that, by a pro- 
cedure analogous to confounding in factorial experiments, enables us to 
divide up the replications into much smaller blocks, and these are used 
as error control units. Since the small blocks contain only a fraction 
of the total number of varieties, they are referred to as incomplete blocks. 
Yates (20) in a preliminary examination of uniformity data concluded 
that incomplete block experiments would gpve increases in efficiency over 
randomized blocks of 20 to 50%. Goulden (6) arrived at practically 
the same conclusion after a fairly extensive study. 
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TABLE 46 

Tbbatmbnt Numbers and CoRBBSPONDiNa Plot Yields toe 3X3X8 Ezebri- 
HENT. The Same Two Deobees or Freedom from the Triple iNiBRAcnoH 
CoNrouMDED IN All Kepucates 


m 

Yield 

Traat- 

ment 

No. 

Yield 

Treat- 

ment 

No. 

Yield 

Treat- 

ment 

No 

Yield 

Treat- 

ment 

No 

Yield 

Treat- 

ment 

No. 

YMd 


ise 

131 

153 

121 

146 

111 

11 

232 

153 


210 


179 

311 

202 

233 

101 

122 


112 

226 


285 


136 

232 

153 

323 

300 

212 

■03 

131 

122 

KfTK 

200 

133 

130 

333 

182 

331 

240 

313 

306 

311 

281 

312 

862 

313 

155 

213 

101 

211 

226 

283 

266 

213 

334 

211 

360 

111 

120 

112 

138 

312 

150 

321 

300 

333 

208 

331 

150 

223 

151 

221 

188 

222 

240 

231 

233 

221 

276 

113 

267 

332 

159 

123 

210 

113 

182 

332 

278 

322 

355 

222 

338 

122 

154 

322 

143 

132 

186 

133 

mm 

123 

247 

323 

323 

Block 












total! 

1351 


1560 


1860 


2212 


2202 


2434 

122 

77 

213 

110 

233 

173 


03 

131 

303 

132 

271 

313 

92 

112 

115 

132 

170 


224 

221 

221 

331 

251 

IB 

128 


170 

113 

213 


245 

213 

216 

211 

100 

■ ■ 

113 


136 

323 

182 

313 

203 

232 

310 

121 

120 

■ ■ 

■SI 

311 

197 

312 

175 


226 

311 

260 

323 

237 

111 


131 

182 

121 

156 


248 

123 

221 

113 

282 

212 

■SI 

322 

212 

211 

183 


260 

333 

350 

312 

303 

332 

215 

221 

120 

222 

138 

111 

228 

322 

310 

233 

300 

223 

132 

232 

162 

331 

192 

212 

326 

112 

314 

222 

247 

Block 












total! 

1217 


1422 


1582 


2152 


2541 


2319 

231 


131 

105 

121 

154 

223 

154 

112 


132 

260 

212 

143 

333 

227 

233 

145 

138 

107 

311 


323 

283 

313 

171 

123 

218 

312 

214 

?32 

107 

232 

318 

312 

170 

321 

190 

112 

180 

331 

219 

321 

227 

131 

250 

■ ■ 

101 

111 

150 

213 

165 

211 

186 

212 

222 

221 

227 


137 

122 

270 


173 

132 

187 

111 

230 

123 

107 

miM 

170 

223 • 

125 

311 

156 

323 

148 

231 

204 

322 

161 

211 

mEM 

332 

150 

322 

212 

222 

300 

122 

201 

213 


113 


133 

104 

221 

234 

113 

Eil 

313 

251 

333 

247 

mm 

la 

Block 












totala 

1423 


1670 


1808 


1883 


2158 


1852 

313 

124 

213 

la 

323 

276 

212 

328 

221 

260 

233 

388 

133 

136 

■ 1 


132 

225 

313 

295 

311 

166 

222 

228 

122 

297 


235 

331 

343 

111 

Km 

213 

mm 

323 

244 

212 

265 

■ ■ 

164 

233 

145 

231 

212 

333 

240 

121 


223 

200 

112 

264 

113 

258 

122 

320 

131 

300 

211 

421 

111 

180 

311 

277 

121 

194 

133 

325 

322 

252 

113 

344 

332 

260 

232 

283 

222 

280 

321 

464 

112 

335 

331 

336 

321 

215 

322 

250 

312 

304 

332 

376 

232 

247 

312 

240 

231 

262 

123 

243 

211 

286 

223 

410 

123 

260 

132 

360 

Block 



M 





■■1 




totala 

1047 

■ 

H 

■ 

2310 


2034 

■ 

2303 


2804 
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TABLE 46 

Tbbatmkit Totals Coluctcd raoM Table 46 ros Calculaiion 
or Sums or Sqoaiibs 


(a) 


(b) 


(e) 




A, 

A, 

A, 



Pi 

Ai 

As 

A, 

1.604 

2,031 

1.845 

1,529 

1,690 

1,912 

1,679 

1,523 

1,910 

4,812 

6,244 

6,667 

* -8 



5,480 

5,131 

5,112 

15,723 




Ai 

A, 

A, 



p» 

A, 

A, 

A, 

1,805 

1,651 

1,845 

1,826 

2,030 

1,913 

1,893 

1,808 

1,879 

5,524 

5,489 

5,637 

ib 8 



5,301 

5,769 

5,580 

16,650 


Pi 

Ai 

A, 

A, 

Ki 

1,994 

1,838 

1,686 

A, 

1,741 

1,716 

1,993 

1,322 

1,743 

1,917 

5,057 

5,297 

5,596 

Jfc - 8 



5,518 

5,450 

4,982 

15,950 




Ai 

As 

A, 


I 

(Pi + Pi + P») 

A, 

As 

A, 

5,403 

5,520 

5,a376 

5,0% 

5,436 

5,818 

4,894 { 

5,074 

5,706 

15,393 

16,030 

16,900 

fc = 24 



16,299 

A, 

16,350 

A, 

15,674 

A, 

48,323 


(ATi + Nt+ Ni) 

Pi 

Pj 

P 3 

5,480 

5,301 

5 518 

5,131 

5,769 

5,450 

5,112 

5,580 

4,982 

15,723 

16,650 

15,950 

A; = 24 



J 0,299 

16,350 

15,674 

48,323 




Pi 

I\ 

P» 



(Ki + As + A,) 

A, 

A, 

A, 

4,812 

5,244 

5,667 

15,723 

5,524 

5,489 

5,637 

16,650 

6,057 

5,297 

5.596 

15,950 

15,393 

16,030 

16,900 

48,323 

k 24 

• 

(Ai + A- + A,) 

A, 

A, 

A, 

Pi 

5,480 

5,131 

5,112 

5,301 

5,769 

5,580 

5,518 

5,450 

4,982 

16,299 

16,350 

15,674 

k 24 



15,723 

16,650 

15,950 


II 

III 

IV 

(«) 

(b) 

(c) 

16,303 

15,720 

16,300 

15,836 

16,506 

15,981 

15,831 

15,764 

16,728 


48,323 

48,323 

48,323 


46,323 


fc - 72 
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TABLE 47 

Analtbib or Variance for 3X3X3 Experiment Showing Sums of Squares 
FOR Individual Treatment Degrees of Freedom 



DF 

ss 

SS 

DF 

MS 

a 

6% 

Point 

Blocks 

23 

548.407 

548,407 

23 

23,844 

8 83 

1 69 

Nr . 

ATj 


1 

1 

15.771 

126 

16,897 

2 

7,948 

2 94 

3 05 



1 

1 

358 

6,128 

6,485 

2 

3,243 

1 20 

3 05 

Kr 


1 

2,713 

1,223 






A'rf 


1 

^,936 

2 

1,968 

0 73 

3 05 

NrXPr 


1 

1,040) 






NrXPd 

NdXPr 


1 

1 

4,737 

16 

5,909 

4 

1,477 

0 55 

2 43 

NdX Pi 


1 

1161 






Nr X K, 


1 

7,332 






X X Ki 

Ni X Kr 


1 

1 

1,508 

1,765 

10,619 

4 

2,655 

0 98 

2 43 

NiXKi 


1 

13 






PrXKr 


1 

294 






Pr XKi . . 


1 

1,850 

11,357 

4 

2,839 

1 05 

2 43 

PdX Kr - . . * 


1 

7,422 


Pd XKd 


1 

1,791 








[2 

3,131 






NXPXK . 


2 

3,452 

14,631 

6 

2,438 

0 90 

2 15 



2 

8,048 






Error . 

168 

453,509 

453,509 

168 

2,699 




215 

1,070,750 







16. Incomplete Block Experiments. There arc a number of different 
types of incomplete block experiments, and only those are described here 
that would seem to be of the greatest practical value in agronomic tests. 
The type which can probably be regarded as the most elementary is 
known as the two-dimensional qnasi-factonol with two groups of sets. By 
extending this type to three groups of sets we have a somewhat greater 
degree of complexity, and this complexity continues to increase with the 
number of groups of sets until we reach the point of using all possible 
groups of sets, wherein the entire process of analysis suddenly becomes 
very much simplified. The latter type may be referred to as a symmet- 
rical incompUle block experiment. Quasi-factorial experiments of the 
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threeHlifnenstonal type are also possible, and one of the simplest of these 
will be described. 

In discussing the general principles involved in incomplete block ex- 
periments we i^all consider an hypothetical experiment with only 9 
varieties. With such a small number of varieties it would probably not 
be worth while to use these methods, but a small example of this kind 
will be quite sufficient to illustrate the general principles. First, we 
take 9 numbers to represent the varieties and write them down in the 
form of a square. These are two-figure numbers, the first figure rep- 
resenting the row and the second the column of the square. 


11 

12 

13 

21 

22 

23 

31 

32 

33 


If we suppose now that this square represents, instead of 9 different 
varieties, 9 combinations of 2 factors at 3 levels as in a simple 3X3 
factorial experiment, the degrees of freedom can be divided as follows: 

A (factor for which levels are indicated by first figure of two-figure numbers) 2 DF 
B (factor for which levels are indicated by second figure of two-figure numbers) 2 DF 
A X B (mtcraction) 4 DF 

Furthermore, since the 4 DF for the interaction can be separated into 
two orthogonal components, each represented by 2 DF, the total of 
8 DF can be split up into 4 pairs. Then if the 9 combinations making 
up a complete replication are divided into 3 blocks, either one of the 
above pairs of degrees of freedom may be confounded with blocks. If 
we should decide to confound the A factor with blocks, the degrees of 
freedom for one replication will be apportioned as follows: 

Blocks .. 2DF 

B 2 DF 

AXB 4 DF 

and the method of confounding would be merely to put the treatments 
together in the same block that occur in the rows of the square given 
above. Similarly the B factor may be confounded by putting the treat- 
ments in the same block that occur in the columns of the square. Then 
from our knowledge of the properties of a Latin square it is clear that 
if the interaction A X is to be confounded it is only necessary to put 
the treatments together in the same block that occur in the diagonals 
of the square. In one replication we can confound only 2 out of the 4 
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degrees of freedom. For example, in one replication the arrangement 
of the treatments in the blocks might be as follows: 


Block 1 . . . 

11 

22 

33 

Block 2 . 

21 

32 

13 

Block 3 

31 

12 

23 


and the degrees of freedom will be divided in the following manner: 

Blorks . 2DF 

A 2DF 

B 2DF 

AXB. 2DF 

Alternative to the above scheme 2 degrees of freedom from the inter- 
action may be confounded with blocks by this arrangement : 


Block 1 

11 

32 

23 

Block 2. 

21 

12 

33 

Block 3 

31 

22 

13 


Finally, it works out that in each replication a different pair of degrees 
of freedom may be confounded with blocks, in which case the analysis 
of variance will be of the following form: 


Blocks ... . 

11 BF 

A .... 

2 DF 

B . , . . 

2 DF 

.4 X B, 

4DF 

Error . 

\^DF 


By a process of partial confounding all the degrees of freedom for the 
9 treatment combinations can be recovered, and at the same time error 
control has been improved by the use of smaller blocks. The loss of 
information due to partial confounding is seen to be exactly since 
each pair of degrees of freedom has been confounded in 1 replication and 
conserved in 3. In other words, both the main factors and the inter- 
action are determined with of the precision that would have resulted 
if there had not been any confounding. The presumption, of course, 
is that the error will be sufficiently reduced by confounding to more 
than make up for the loss in precision. 

Returning now to the testing of 9 different varieties, it should be 
obvious that, if the varieties are designated by numbers and arranged 
in a square as above, we can go through the same procedure of partial 
confounding as has been outlined above for a 3 X 3 factorial experiment, 
and theoretically the same increase in accuracy due to confounding will 
be obtained. The method of analyns will also be clear from these con- 
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siderations, as we work it out in the first place as thouj^ it is a factorial 
experiment and, after finding the sums of squares for the imaginary 
factors and their interaction, we combine these to form the variety sum 
of squares. 

The fact that the variety numbers are first arranged in the form of a 
square simulating a two-factor experiment is the basis of the term two- 
dimensional.” The number of groups of sets is based on the number of 
groups of degrees of freedom that are confounded with blocks. In the 
quasi-factorial 3X3 experiment, for example, the 8 DF for the 9 treat- 
ments can be divided orthogonally into 4 pairs, and if we confound only 
2 of these pairs, the experiment is said to consist of two groups of sets.” 
With 9 varieties we have seen that 4 pairs of degrees of freedom can 
be confounded, in which case we might refer to the experiment as one 
with “four groups of sets,” but as pointed out above it is usual to refer 
to experiments of this type as symmetrical incomplete block experiments. 

In a quasi-factorial experiment with only two groups of sets it will 
be obvious that all comparisons arc not made with the same precision. 
Suppose, for example, that the blocks are made up out of the rows and 
columns of the square, in which case the analogous factorial experiment 
would be outlined as follows: 


Blocks 5 DF (assuming 2 replicates only) 

A 2DF 

B 2DF 

AX B 4 DF 

Error 4 DF 


In which the imaginary factors A and B are confounded in one replicate 
and conserved in the other, while the interaction A X B is conserved 
in both replicates. The main factors A and B are determined with 
only ^ the precision with which the interaction is determiiled, and 
transferring these ideas to a variety experiment it becomes clear that the 
varieties that occur in the same row and in the same column are compared 
more accurately than those that do not occur at all in the same block. 

Another point that we should note here is that in estimating the result 
for any one treatment combination of the partially confounded factorial 
experiment, or of one variety in the quasi-factorial experiment, it will 
be necessary to make a correction for the blocks in which they occur. 
The actual totals are partially confounded with blocks. One variety 
may occur mainly in low-yielding blocks and another one in high- 
yielding blocks, and therefore the actual yield of the first variety must 
be increased and the yield of the second variety lowered, in order to 
make the two variety yields comparable. The details of this method 
of correction are given below. 
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16. Two-Dimensioiial Quari-Factoiials with Two Groups of Sets. 
Assuming that only 9 varieties are to be tested, the first step is to take 
9 numbers to represent the varieties, as pointed out above, and arrange 
them in the form of a square. The next step is to arrange the varieties 
in sets according to the rows and columns of the square. These are 
given below and the first group of sets is referred to as group X and the 
second group of sets as group Y. 

Group X Group Y 


11 

12 

13 

11 

21 

31 

21 

22 

23 

12 

22 

32 

31 

32 

33 

13 

23 

33 


The varieties in the sets are those that are assigned to the incomplete 
bjocks, and each group makes up a complete replication. The varieties 
occurring in the same block are, of course, those that are between the 
same set of parallel lines in the above figure. The groups can now be 
repeated as many times as we wish in order to bring up the replicates 
to the required number. The varieties are randomized within each 
block, but the blocks themselves may be placed in any order. ^ 

Figure 11 illustrates diagrammatically the set up of the experiment 
assuming 4 complete replications. The yields may be arranged iii a 
form somewhat similar to this for convenience in calculation. After 
setting up the original yields they must be combined for each group and 
then for both groups. The marginal totals are then obtained for each 
group and for both groups combined, and we are ready to proceed with 
the calculation of the sums of squares and the corrected variety means. 

The calculation of the variety sum of squares follows from the analogy 
to a factorial experiment. 

DF 

In Group Y A = 2(75 )/np - /np^ p-1 

In Group X B = MX\)/np - Xl/njr^ p-1 

Group X + Group Y{A X B) = 2(r5,)/2n - 2(r5 )/2np 

- 2(r?,)/2np + T?. ./2np* (p-l)* 

^ In certain cases the experimenter may decide, even after conducting the experi- 
ment as a quasi-factorial, to use the actual yields or some other character of the 
varieties, without correction. For example, he may wish to make quality or other 
tests on composite samples made up from all the replicates. For this purpose it is 
somewhat better to have the incomplete blocks randomized within each replication. 
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Group X 


11 

12 

13 

11 

12 

13 

21 

22 

23 

21 

22 

23 

31 

32 

33 

31 

32 

33 

jf.. 


Xu 

XU 

Xu 

Xi. 

X2l 

X22 

X2i 

Xi. 

xn 

2S2 

Xu 

Xi. 

Xi 

X.i 

X.^ 

X. 


Group Y 


11 

12 

13 


11 

12 

13 

21 

22 

23 


21 

22 

23 

31 

32 

33 


31 

32 

33 




Til Ti2 Ti3 

3’i 

Tai T 22 Taa 

Tt 

Tai Tu Tu 

Tt 

Ti T.i r., 

T.. 


Fig. 11. Representation of a miniature example of a two-dimenaional quasi-factorial 
experiment with two groups of sets. 


where p is the number of varieties in one set and n is the number of 
repetitions of each group. 

Yates (20) gives a direct method of calculating the sum of squares for 
varieties which is probably quicker than the one used abo've. Yates’s 
formula is 

Varieties (SS) = X(Tl)/2n+X(Xu - Yu )V2np+2(X.. - Y ^fl2np 
- (X - K .)V 2 np 2 - [2(X2 ) + S( Y?.)]/np 

We next calculate the total sum of squares for all the plots and for the 
blocks, and obtain the error sum of squares by subtraction. The sum- 
marized analysis is of the form 

Blocks 2np 1 

Varieties p* — 1 

Error (p — l)(2np - p - 1) 


Total 


2fip* — 1 
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Just as in the factorial experiments that have been confounded all 
comparisons miist be made within blocks. This means that to compare 
2 varieties directly we cannot use the actual variety totals but must 
prepare for these varieties ratings based on their behavior as compared to 
other varieties in the same blocks. The least squares method gives us 
as the best rating for any variety uv, the following expression which we 
shall refer to as a corrected variety mean. 


<« = ^' + “ (X., - y (y. - X. ) 
2n 2np 2np 


If a large table of yields is to be corrected it may save time to set up the 
corresponding portions of the correction in the margins of the table. If 


we let C . = — ^ (JT., — Y.J) and Cm. = r— (I"*. — Xu ), then C. i will be 
2np 2np 

the portion to be added to all the variety means in the first column, and 
Cl. will be the portion to be added to all the variety means in the first 
row. 

In this as in all other quasi-factorial arrangements the error variance 
must be multiplied by a factor depending on the type of experiment, to 
give the variance for comparing 2 varieties by their corrected means. 
If is the error variance, the variance of the difference between the 
corrected means of 2 varieties that occur in the same set is 


V(t2l — ill) 




For 2 varieties not having a set in common the variance of the difference 

Ffe-.n) 

The mean variance of all comparisons is 




n\p+lj 


and when p is not too small we may use the latter variance for all com- 
parisons without appreciable error. 

Example 87. Two-DimenBional Quasi-Factoiiol with Two Groupi of Seta. 
Uoing uniformity data and aosuming a teat of 26 varieties in 4 replicationa this 
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example has been worked tlirough in detail in order to show the methods of calcula- 
tion. Setting up first the specifications of the test: 


Varieties in each set (p) » 6 

Varieties (») = p® - 25 

Sets (s) 2p ... . — 10 

Replications of each group (n) — 2 

Replications (r) — 2n — 4 

Blocks (6) « 2np .... — 20 

Total number of plots (V) = 2np* = 100 


The variety numbers are first written down in the form of a square: 


11 

12 

13 

14 

15 

21 

22 

23 

24 

25 

31 

32 

33 

84 

36 

41 

42 

43 

44 

45 

61 

52 

53 

54 

55 


and the 10 sets in 2 groups of 5 taken from the rows and columns of the square. The 
varieties in these sets are then randomized in the blocks as indicated in Table 40. 
Here the groups are repeated twice so that (n « 2) and (r — 4), and the groups are 
separated in the field. It might be wise if there is a marked difference in variability 
in different parts of the field to randomize the blocks over the whole field instead of 
keeping them together as complete replications, but in general this would seem to bo 
unnecessary and it is a decided convenience from the standpoint of making observa- 
tions on the plots to have all the plots in one replication together. 

After obtaining the block totals and the grand total the next step is to set up 
Table 50, the construction of which should present no difficulty. Note that the 
marginal totals and are those in which variety and block effects are con- 
founded. 

By the shortest method the sum of squares for varieties is calculated as follows— 


2(r*,)/2n 

)V2np . . . . 

rts:., - F,)*/2np.. . . 

-(X - F..)*/2np* 

-|2(xJ) +2:(r.;)l/np ... 


- 1,061,637.50 

- 81,162.50 

- 117,817.50 

« - 51,076.50 

* —2,058,800 00 (Groups + Sets + Mean) 


Total «= Varieties {SS) ... = 60,741.50 


The total sum of squares for all plots is 630,266.00 and for blocks is 467,586.00. 
Having obtained these, we can set up the analysis of variance. 


TABLE 48 

Analysis of Variance 

Two Dimensional Quasi-Factorial — Two Groups of Sets 



8S 

DF 

MS 

F 

5% Point 

Blocks 

467,586.00 

19 

24,609.8 

12 3 

1.78 

Varieties. . . 

50,741.50 

24 

2,114.2 

1 06 

1.72 

Error . . . 

111,938 50 

56 

1,998.9 



Total 


99 
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In order to obtain the corrected variety yields we calculate 
“ 2 ^ ~ »- 6 
2np - 1. 2, 3. 4, 5 


• These are entered in the margins of a (5 X 6) table as in Table 51 and added to the 
actual means of corresponding cells in the table. 

To obtain a further check on the sums of squares for varieties we can now calculate 
it in another w ly using the formula 

Varieties {SS) ^(fuv'Tuv) “ Xu) - S(f 


where li., for example, is the mean of all the iuv values m the first row of Table 51 
and i.\ is the mean of the first column 

To make comparisons between the corrected means we may if we wish to be exact 
take into consideration whether or not the varieties being compared occur in the same 
set. To compare varieties 21 and 22, for example, we calculate the variance accord- 
to the formula 

SEihi - in) = ^1199 3 - 34 63 




161.50 - 123.75 
34 63 


1Q9 


To compare varieties 11 and 54 we would have 

SE{in - fw) “ VT399.23 = 37.41 




135.25 - 170.25 
37.41 


0.94 


We would obviously not be very far wrong, even with a 
for all comparisons the mtan variance for the difference 
would be 




8 \ 

"nVp + iy“\ 2 


p value as 
between 2 


- 1332.6 


low as 5, 
varieties. 


to use 
This 


SBm - V1332.6 - 36.50 


* The i used here is, of course, the statistic defined by R. A. Fisher in '^Statistical 
Methods for Research Workers." 
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TABLE 49 

Position of Varieties in the Field and Corresponding Plot Yields. 
Two-Dimensional Quasi-Factorial Experiment 
with Two Groups of Sets 


Set 

No. 

Vari- 

ety 

No. 

Yield 

Vari- 

ety 

No. 

Yield 

Vari- 

ety 

No. 

Yield 

Vari- 

ety 

No. 

Yield 

Vari- 

ety 

No. 

Yield 

Block 

Totals 

ly 

31 

215 

21 

300 

51 

255 


185 

11 

145 

1,100 

2y 


150 

12 

50 

52 

45 


105 

42 

155 


6y 


125 

35 

30 

15 

65 


130 

45 

55 

405 

4y 


85 

34 

55 

54 

no 

24 

130 

44 

40 

420 

3y 

53 

45 

43 

45 

13 

60 

23 

15 

33 

-5 

160 

ly 

11 

210 

21 

290 

41 

325 

31 

230 

51 

220 

1,276 

2y 

12 

310 

32 

230 

22 

155 

52 

195 

42 

245 

1,135 

5y 

15 

315 

45 

215 

55 

160 

25 

285 

35 

230 

1,205 

3y 

63 

185 

43 

220 

33 

175 

13 

275 

23 

185 

1,040 

4y 

14 

130 

24 

190 

34 

160 

44 

no 

54 

155 

745 

lx 

14 

140 

15 

165 

11 

265 

13 

150 

12 

180 


4x 

41 

190 

42 

135 

45 

100 

43 

145 

44 

205 

775 

3x 

33 

250 

31 

150 

35 

150 

34 

195 

32 

155 


2x 

22 

75 

21 

105 

25 

130 

23 

180 

24 

90 

580 

5x 

55 

40 

54 

155 

53 

65 

52 

60 

51 

40 

360 

5x 

55 

115 

54 

185 

53 

240 

51 

120 

52 

125 

785 

lx 

11 

145 

13 

105 

14 

50 

15 

130 

12 

135 

565 

3x 

32 

150 

33 

115 

34 

60 

35 

no 

31 

25 

460 

2x 

21 

5 

24 

65 

25 

70 

23 

60 

22 

20 

220 

4x 

41 

30 

42 

50 

43 

35 

45 

20 

44 

50 

185 









G] 

1 

rand T 
1 1 

'otal = 

1 

13.720 


TABLE 50 

Yields of Varieties by Groups, and Total Yields for Both Groups 

Values of Xu 9 


X 

u \ 

1 

2 

3 

4 

5 

Xu. 

1 

410 

315 

255 

190 

295 

1,465 

2 

no 

95 

240 

155 

200 

800 

3 

175 

305 

365 

255 

260 

1,360 

4 

220 

185 

180 

255 

120 

960 

5 

160 

185 

305 

340 

155 

1,145 


1075 

1085 

1345 

1196 

1030 

5,730 «= 
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Group Y 


Group X 
Group Y 


TABLE 50— Uofih'nued 
Values of y^v 



1 

2 

3 

4 

5 

r. 

1 

355 

360 

335 

215 

380 

L645 

2 

590 

305 

200 

320 

415 

L830 

3 

445 

335 

170 

215 

260 

L425 

4 

510 

400 

265 

150 

270 

L595 

5 

475 

240 

230 

265 

285 

1,495 

Y., 

2375 

1640 

1200 

1165 

1610 

7,990 - 



Values 

of T„ 





1 

2 

3 

4 

5 

Tn. 

1 

765 

675 

590 

405 

675 

3,110 

2 

700 

400 

440 

475 

615 

2,630 

3 

620 

640 

535 

470 

520 

2,785 

4 

730 

585 

445 

405 

390 

2,555 

5 

635 

425 

535 

605 

440 

2,640 


T., 3450 2725 2545 

2360 

2640 1 13,720 

1 

e 

1 


Yu. - Xu 

1 -1300 

1 

180 

2 - 555 

2 

1030 

3 - 145 

3 

65 

4 30 

4 

635 

6 - 580 

5 

350 

1 

fl 

1 

s 

(I".. - 

X. ) - 2^ 


TABLE 61 

Calculation of Corrected Variety Means (fui;) 


X 

1 

2 

b 

4 

5 

Cu. 

1 

135 25 

150 00 

163 75 

111 75 

148 75 

9 00 


161 50 

123.76 

168 75 

171.76 

176 25 

51 50 


93.25 

135.50 

144 25 

122 25 

104.25 

3 25 


149.25 

150.25 

150 25 

134.50 

100 25 

31 75 

5 

111 25 

96 00 

158 50 

170.25 

98.50 

17 50 

c. 

-65 00 

-27.76 

7.26 

1.50 

-29.00 

0 



C.i- 

-1300/20 - 

-65.00 





Ch - 

180/20 - 

9.00 




17. Two-Dimensional Quasi-Factorials with Three Groups of Sets. 
A possible criticism of the quasi-factorial method with two groups of 
sets as described above is that there is too great a discrepancy between 
the estimates of the error variance for comparing varieties in the same 
and in different sets. This can be partly overcome by increasing the 
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number of groupSi and hence the type with three groups of sets is theo- 
retically an improvement over the previous type. It requires, howevefi 
more computation, and the number of replications must be a multiple 
of 3. Details for setting up and analyzing such experiments may be 
found in the reference of Yates (20). 

18. Three-Dimensional Quasi-Factorials with Three Groups of Sets. 
In the two-dimensional types the varieties were represented by two- 
figure numbers corresponding to the two dimensions of a square. In 
the three-dimensional types the varieties are represented by three-figure 
numbers (uvw) corresponding to the three dimensions of a cube. Thus 
in a cube with p numbers on a side we can represent varieties, and 
taking these numbers in sets of p by slicing in three directions we can 
make up sets. There will be three groups of p^ sets, each one cor- 
responding to a direction in which the cube is sliced. At this point 
the student should draw up a cube, put in the niunbers, and practice 
writing out the sets. It will then be noted that the sets can be written 
out directly for any v^alue of p by expanding the sets given below for 
p = 3. 

When the number of varieties is ver}" large, say 216 or morcj, there 
are decided advantages in using this type of experiment, as with any 
other typ*^ the blocks would still be rather large. 

The details of setting up and analyzing a three-dimensional experi- 
ment may be obtained from Example 38. 


Example 38. Three-Dimensiona] Quasi-Factorial Experiment with Three Groups 
of Sets. The specifications are. 


Varieties (v) = p* 

Sets («) = 3p* 

Replications of each group (n) 
Complete replications (r) = 3n 

Total number of blocks (6) = 3np* 

Total number of plots (iV) = 3np* 


27 

27 

2 

6 

54 

162 


After formmg the (3X3X3) cube we can write out the sets as follows: 


Group X{-vw) 

Set No. 

1 111 211 311 

2 112 212 312 

3 113 213 313 

4 121 221 321 

5 122 222 322 

6 123 223 323 

231 331 

232 332 

233 333 


Group Y{u w) 
Set No. 


1 

111 

121 

131 

2 

211 

221 

231 

3 

311 

321 

331 

4 

112 

122 

132 

5 

212 

222 

232 

6 

312 

322 

332 

7 

113 

123 

133 

8 

213 

223 

233 

9 

313 

323 

333 


Group ) 
Set No. 


1 

111 

112 

113 

2 

121 

122 

123 

3 

131 

132 

133 

4 

211 

212 

213 

5 

221 

222 

223 

6 

231 

232 

233 

7 

311 

312 

313 

8 

321 

322 

323 

9 

331 

332 

333 


7 131 

8 132 

9 133 
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After the distribution of the blocks over the field and the randomization of the 
varieties within the blocks we have such an arrangement as is shown in Table 53, 
in which the individual plot yields corresponding to the varieties are given. In this 
case the blocks are distributed at random over the whole field, but it would have been 
more convenient to keep them together in complete replications. 

The calculations are carried out in tabular form in Table 54 The data are first 
collected by groups so that the yield of any one variety in one group will be a total 
of n plots. The marginal totals are obtained as indicated in three directions, and it 
will be noted that X vwi Yu wt ^nd Zuv represent the totals for the sets. The complete 
variety totals represented by Tuv are entered next and all the marginal totals of these 
obtained. 

For calculating the corrected variety means (tuw) the most convenient formula is 


tuvw ■= + C ru» + C'l* w 4- Cuv 

where 

G’vw (pT vw ““ 3pX VI0 — T u- "i"3F*) 

Cu 10 *= ^pYu 10 T 10 4" 3Z. w) 

Cuv “ ^^2 ^pZuv *“ Tu • 4" 3Xu ) 

Thus 

C 11 = (3 X 2736 - 9 X 340 - 9875 + 3 X 3636) = 67 176 

, lOo 

Cl 1 ” ^ (3 X 3330 - 9 X 1386 - 9645 + 3 X 3105) = -26 972 

C,i = (3 X 3305 - 9 X 1186 - 9470 + 3 X 3180) = - 6 296 


Having obtained all the correction terms, we check by obtaining the total, which in 
this caas comes to +0 001. This is a sufficiently close check. 

The corrected means are obtained by adding the corresponding correction terms 
to the actual means. For example, tm *151 .667 4-67. 176 —25.972 — 6.296 = 176 575. 

To obtain the sum of squares for varieties we first average the corrected means in 
three directions to give t w, fv-w, and To illustrate this: 

til - 4 (176.575 4- 190.001 + 164.723) - 177.100 

fi 1 - 1 (176.575 4- 192.222 4- 224.028) = 197 608 

tiv - 4 (176.575 4- 180.656 4- 197.917) - 185 016 

The sum ct squares for varieties is then given by 

Varieties (S5) ""^{tuvw'Tuw) ” 2(X.0i0*f.ti0) — 2 (Zm*. tuv) 

which in this case is 

5,847,432.06 - 5,754,971.44 - 92,460.62 
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Then after calculating the total and block sum of aquarea from Table 53, we can aet 
up the analyaia of variance. 


TABLE 62 

Analtbib or Variancb 

Thubb-Dimenbional Quabi-Factorial Experiment 
WITH Three Groups of Sets 



SS 

DF 

MS 

F 

5% Point 

Blocks . . ... 

1,154,025 

53 




Varieties 

92,461 

26 

3556 

1.23 

1 62 

Error 

236.872 

82 

2889 



Total 

1,483,358 

161 





The variances and standard errors for comparing the varieties are as follows. 
It will be noted that such comparisons now fall into three groups that can be de- 
termined from the variety numbers. 


V(lni -tin) = ^ (P*+P+1) - X 13 = 1301 


V(tm-t,ii) - ^ (2p*+3p+4) 


2889 

54 

2889 


X 31=1658 

X 33=1766 


r(ta.-t„)-j^(V+3,+e) ^ 

And the mean variance of all comparisons is 


5JJ=n/I 35T-37.30 
5i?- 1/1555-40.72 
n/T755-42 02 

5£-\/I53D-40 37 


19. Symmetrical Incomplete Block Experiments. It will be remem- 
bered from the discussion of Section 15 that, if all the possible groups of 
degrees of freedom are not confounded, certain of the comparisons are 
determined with less precision than others. For this reason in using 
the quasi-factorials we have two or more standard errors depending on 
the “ dimensions ” of the experiment. This difficulty can be overcome 
by confoimding all the possible groups of degrees of freedom or in other 
words by using all the possible groups of sets. We then have a design 
that is perfectly symmetrical and not only do we have equal precision 
for all comparisons but also the calculations are considerably (amplified. 

The chief problem in setting up the design of a symmetrical experi- 
ment is in writing out the sets. For this purpose we can conveniently 
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TABLE 63 


Position of Vabibubb in thb Fibld and Corbbbfondino Plot Yibldb Tbrbb- 
Diicbnsional Quabx-Factorial Expbbimbnt with Thbbb Gboupb of Sets 


B§t 

No. 

Vui- 

ety 

Yield 

Vvi- 

•ty 

Yield 

Vm- 

•ty 

Yield 



Yen- 

•iy 

Yield 

Vui- 

rty 

Yield 

Vari- 

ety 

rield 

Block 

Totob 

2i 

212 

315 

IQ 

370 



IffW 

5 

122 

105 

112 

310 



820 

By 

222 

285 


355 


345 

005 

cl 

233 

215 

231 

330 



815 

By 

322 

245 


185 


100 

500 

•y 

333 

KM 

313 

■3 


140 

525 

By 

223 

285 


355 

213 

240 

880 

7k 

231 


131 

B] 

831 

235 

075 

By 

211 

325 


315 

231 

300 

040 

By 

312 

k3 

111 


332 


035 

h 

122 

240 

Til 

220 

222 

350 

810 


•121 

255 

321 

El 

221 

230 

720 

2k 

212 

300 

312 

230 

112 

mTTm 

815 



275 

222 

245 

212 

140 

000 

Sy 

331 

270 

311 

255 

321 





270 

222 

230 

221 

135 

035 

Ak 

323 

175 

123 

200 

223 





05 

221 

245 

223 

330 

070 

0x 

323 

180 

123 

275 

223 

200 



332 

215 

333 

300 

331 

Kl’B 

770 

3y 

321 

155 

831 

180 

311 

100 



213 

185 

313 

145 

113 

150 

480 

»y 

323 

120 

313 

70 

333 

100 

200 

tI 

333 

mm 

133 

45 

233 

105 

200 

7y 

113 

100 

123 

170 

133 

05 

335 


322 

155 

323 

125 

321 

30 

310 

Ok 

233 

55 

333 

145 

133 

40 



131 

65 

331 

130 

231 

55 

250 

U 

111 

35 

311 

45 

211 

55 


2. 

122 

85 

123 

55 

121 

no 

250 

7y 

123 

140 

133 

45 

113 

15 

Vgji 

la 

331 

130 

332 

40 

333 

45 

210 

»u 

in 

85 

211 

65 

311 

55 


il 

131 

45 

132 

00 

133 

15 

120 

li 

112 

80 

111 

115 

113 

105 


19 

313 

0 

213 

70 

113 

05 

135 

ly 

121 

180 

111 

255 

131 

200 

725 

8y 

223 

285 

213 

270 

233 

185 

740 

te 

222 

150 

122 

55 

322 

50 

255 

ly 

111 

210 

131 

205 

121 

185 

000 

8k 

332 

130 

132 

215 

232 

155 

500 

2y 

211 

05 

221 

05 

231 

165 

345 

4k 

121 

210 

221 

00 

321 

05 

305 

m 

213 

100 

212 

140 

211 

125 

■SI 

7k 

311 

140 

312 

105 

313 

310 

045 

U 

111 

210 

112 

209 

113 

326 

ISI 

4y 

132 

230 

122 

220 

112 

310 

700 

4i 

211 

KM 

213 

155 

212 

105 

B3 

3i 

132 

245 

133 

315 

131 

215 

775 

8k 

132 

100 

232 

285 

332 

230 

075 

6i 

232 

185 

233 

220 

231 

175 

580 

7i 

311 

275 

313 

185 

312 

130 

500 

2i 

121 

100 

122 

160 

123 

no 

400 

Bk 

323 

155 

321 

150 

322 

240 

545 


divide such experiments into two types: (1) where the number of vari- 
eties (v) = p^; and (2) where v = — p + 1. There are, of course, 

other types, but the two mentioned are likely to be of the most value in 
field experiments. Considering the first type, (v = p^), it is obvious 
that the variety numbers can be written in the form of a square. Sup- 
pose that we have 9 varieties; then the square is 


11 

12 

13 

21 

22 

23 

31 

32 

33 


The first two groups of sets are written as for a two-dimensional quasi- 
factorial, from the rows and fcolumns of the square. Two more groups 
may then be written from the diagonals of the above square. These are 


11 

22 

33 

11 

32 

23 

21 

32 

13 

21 

12 

33 

31 

12 

23 

31 

22 

13 
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1,385 1,285 1.190 3,860 1,580 1,625 1,525 4,730 ' 535 1,620 815 2,070 | 3,500 4,530 3,530 
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the second one being Ti^ritten from the diagonals of the first. This must 
be all the groups, as we know from a study of the degrees of freedom in 
a Latin square, and also from the fact that, if we repeat the process on 
the last square written, the original square is regenerated. The maxi- 
mum number of groups that can be written is always p + 1. On exam- 
ining these sets we note that each variety occurs once and once only in 
the same set with any other variety. Taking variety 11 the sets in 
which it occurs are 

(11 12 13), (11 21 31), (11 22 33), (11 32 23), 

and in these four sets all the other varieties have occurred once. 

If p is a prime number the above method of writing out the sets will 
work for the type (v = pr) If p is not a prime number w(‘ must make 
use of a completely orthogonalized square, if su(‘h a square can be pre- 
pared. For p = 6 the orthogonalized square is impossible, so that we 
cannot write more than three groups of sets. This is the same as saying 
that a Latin square is possible for any number of rows and columns, 
•but Graeco-Latin squares are impossible for certain numbers, Fisher (2). 
A completely orthogonalized 4 X 1 square is given below, and further 
squares are given in R. A. Fisher's ^'Design of Experiments,'' 1937. 

Complelrly Orthogonalized 4X4 Square 


. 

Ill 

234 

342 

423 


222 

143 

431 

314 


333 

412 

124 

241 


444 

321 

213 

132 


This square may be used to show how the sets for 16 varieties can be 
made up. 

The first two groups of sets are obtained from the rows and columns 
of the'square of variety numbers in the usual way, and the orthogonalized 
square is used to write out the remaining groups. Assuming that the 
square of variety numbers is as follows: 


11 

12 

13 

14 

21 

22 

23 

24 

31 

32 

33 

34 

41 

42 

43 

44 


and is superimposed on the orthogonalizetl square, we note, considering 
the first of the three-digit numbers only, that 1 corresponds with the 
variety numbers 11, 22, 33, 44; 2 with the numbers 21, 12, 43, 34; 3 
with 31, 42, 13, 24; and 4 with 41, 32, 23, 14. These are the sets for the 
third group, and we make up two more groups by using the second and 
third figures of the orthogonalized square. 
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To write the sets for the type v 

= p* — p -f 1, it is only necessary 

to'modify the above procedure. Suppose that v = 13; thenp — 4 and 
p — 1 « 3. A convenient method of designating the varieties is as 

follows: 


01 02 

03 04 

11 

12 13 

21 

22 23 

31 

32 33, 


and if the sets are written for the 9 numbers in the square, the sets for the 
13 varieties are obtained by making one set out of 01, 02, 03, 04, and the 
remaining sets by adding one of these to the sets of each group formed 
by the other 9 numbers. The sets finally are as follows: 

01 02 03 04 

01 11 12 13 02 11 21 31 03 11 22 33 04 11 32 23 

01 21 22 23 02 12 22 32 03 21 32 13 04 21 12 33 

01 31 32 33 02 13 23 33 03 31 12 23 04 31 22 13 


If the number of varieties is 21, 
below: 

01 C2 
11 
21 
31 
41 


numbers would be written out eA 


04 

05 

13 

14 

23 

24 

33 

34 

43 

44 


the 


03 

12 

22 

32 

42 


and we would have to use a completely orthogonalized 4X4 square in 
order to make up the 20 sets for the 16 numbers in the square, to which 
the remaining numbers would be added as described above. 

Special mention should be made of the fact that, as the sets are 
written out by the methods described above for the y = + 1 

type, the blocks cannot be arranged so that they form complete replica- 
tions. There is a method of making up the sets (Youden’s square) by 
means of which all the blocks are placed side by side and all the plots in a 
single row from one end of the field to the other would form a complete 
replication. This method is likely to be of considerable value in labora- 
tory experiments, but in field plot experiments it is not likely that the 
long narrow strips one plot wide would be of any value in error control. 

Ezample 39. A Symmetrical Incomplete Block Experiment for 26 Varieties and 
6 Replications. The sets have been written out by the method described above, and 
those for each group have been kept together to form complete replications. This 
will be obvious from Table 55, and it will be noted also that no attempt has been 
made to randomize the blocks. All the randomization is of the varieties within 
blocks. It is convenient to enter on the plan of the field the individual yields and 
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the block totals. The variety totals are obtained by collecting the individual yields 
as in Table 56. These are denoted by The figures in the column headed Sut 
are obtained by adding for any one variety the totals for all the blocks m which that 
variety occurs. Thus from Table 65 for variety 11 we have 

2ll - 267 + 181 + 177 + 265 + 271 + 303 - 1454 

The second last column is obtained as indicated, and this can be checked by adding, 
* as the total for all the (pT«« — £««) values is zero. The last column gives the cor- 
rected variety means (^iv) which are given by the formula 

, VTw - ^uv . 
tuv + m 

where m is the general mean of the whole experiment and v is the number of varieties. 
The sum of squares for varieties is given siropl}' by 


Varieties {SS) 






• The analysis of variance can then be set up as at the foot of Table 56. The 
method is also given for calculating the variance of a difference between two cor- 
rected means. The general formula is 


Vm 



where r is the number of replications. 
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TABLE 55 

Location of the Varieties in the Field and Corresponding Yields. Sym- 
metrical Incomplete Block Experiment for 31 Varieties and 6 Replications 

Replicate VI 


IMot No. 

1 2 3 4 5 

6 7 8 9 10 

11 12 13 14 15 

16 17 18 10 20 

21 22 23 24 25 

Variety 

52 34 25 11 43 

12 44 35 21 53 

31 13 54 22 45 

41 14 32 23 55 

33 15 51 42 24 

Yields 

57 52 38 60 50 

31 31 28 32 24 

24 40 10 20 30 

40 35 32 10 36 

36 44 40 57 68 

Block 






totals 

257 

146 

133 

162 

251 


Replicate V 


Replicate total » 010 


Plot No. 

1 2 3 4 5 

6 7 8 0 10 

11 12 13 14 15 

16 17 18 10 20 

21 22 23 24 25 

Variety 

.35 23 42 54 11 

.33 52 14 21 45 

55 12 43 24 31 

41 53 15 34 22 

25 44 33 13 51 

Yields 

54 30 28 40 20 

14 11 10 24 10 

30 42 32 28 30 

32 38 26 16 20 

10 24 8 12 26 

Block 






totals 

181 

78 

162 

132 

80 


Replicate IV 


Replicate total — 642 


Plot No 

1 

2 

.3 

4 

5 

G 

7 

8 

0 

in 

11 

12 

13 

14 

15 

16 

17 

IS 

10 

20 

21 

22 

23 

24 

25 

Vanety 

32 

24 

45 

5J 

11 

34 

1.1 

21 

42 

55 

52 

23 

15 

31 

44 

54 

33 

12 

41 

25 

35 

22 

51 

4.3 

14 

Yields 

57 

39 

25 

32 

24 

7 

23 

18 

24 

24 

30 

42 

16 

16 

23 

23 

30 

35 

18 

21 

20 

?3 

15 

16 

27 

Block 


























tuUls 



177 





06 





127 





138 





101 




Replicate III 


Replu ate total 630 


Plot No 

1 

2 

3 

4 

5 

6 

1 

8 

0 

10 

11 

12 

1,3 

14 

15 

16 

17 

18 

19 

20 

21 

22 

23 


25 

Variety 

33 

44 

22 

55 

11 

15 

32 

43 

51 

21 

11 

5.3 

.>1 

25 

42 

13 

35 

24 

52 

41 

12 

23 

34 

51 

45 

Yields 

Block 

74 

57 

34 

49 

61 

45 

43 

31 

44 

40 

41 

36 

28 

8 

lfi 

19 

30 

23 

22 

35 

25 

31 

10 

23 

27 

totals 



265 





203 





120 





129 





125 




Replicritr 11 


Replicate total « 861 


Plot No. 
Vancty 
Yielda 
Block 
totala 


1 

2 

3 

4 

5 

6 

7 

8 

9 

10 

11 

12 

1.3 

14 

1.5 

16 

17 

18 

19 

20 

21 

22 

23 

24 

25 

11 

31 

51 

21 

41 

22 

12 

42 

5J 

32 

13 

51 

43 

.33 

2.1 

14 

44 

54 

24 

34 

45 

55 

35 

16 

25 

52 

57 

40 

79 

43 

36 

33 

24 

44 

,32 

2 * 

27 

11 

18 

32 

37 

29 

24 

37 

32 

22 

10 

21 

20 

20 



271 





160 





110 





150 





117 




Repheate I 


Repin ate total « 820 


Plot No 

1 2 3 4 5 

6 7 8 0 10 

11 12 13 14 15 

16 17 18 19 20 

21 22 23 24 25 


12 14 15 13 11 

23 22 21 21 25 

34 35 32 31 33 

43 44 41 45 42 

53 51 52 55 54 


74 65 54 66 44 

48 57 37 44 65 

33 35 37 38 30 

46 72 57 62 89 

76 54 65 75 84 








303 

241 

173 

320 

344 


Replicate total « 1387 
Qnuid total -5294 
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TABLE 56 

Yields of Single Plots by Vabieties, Variety Totals, Values of and the 
Corrected Means Symmetrical Incomplete Block Experiment for 25 

Varieties and 6 Replications 


tur 


Vari- 

ties 

VI 


IV 

III 

II 




dTo« — 2-,. “ 

s«. 

11 

60 

20 

24 

51 

52 

44 

251 

1.454 

-199 

27 

33 

12 

31 

42 

35 

25 

33 

74 

240 

1,043 

157 

41 

57 

13 

40 

12 

23 

19 

22 

66 

182 

860 

50 

37 

79 

14 

35 

10 

27 

41 

37 

65 

215 

932 

143 

41 

01 

15 

44 

26 

16 

45 

29 

54 1 

214 

1,133 

-63 

32 

77 

21 

32 

24 

18 

40 

79 

37 ! 

230 

1,035 

115 

39 

89 

22 

20 

20 

23 

34 

36 

57 

190 

1,041 

-91 

31 

65 

23 

19 

39 

42 

31 

32 

48 

211 

946 

109 

39 

65 

24 

68 

28 

39 

23 

1 37 

44 

239 

1,119 

76 

38 

33 

25 

38 

19 1 

1 21 

8 

26 

55 

167 

971 

-136 

29 

85 

31 

24 

30 

16 

28 

67 

38 

193 

995 

-30 

34 

09 

32 

32 

8 

57 

43 

32 

37 

209 

973 

72 

38 

17 

33 

36 

14 

39 

74 

18 

30 

1 211 

1,015 

40 

36 

89 

34 

52 

16 

7 

19 

32 

33 

1 159 

942 

-147 

29 

41 

35 

28 

54 

20 

30 

21 

35 

i 188 

847 1 

93 

39 

01 

41 

40 

32 

18 

35 

43 

57 ! 

! 225 

1.158 1 

-33 

33 

97 

•42 

57 

28 

24 

16 

24 

89 1 

1 238 

1,152 

38 

30 

81 

43 

50 

32 

16 

31 

11 

46 

1S6 

1,159 

-229 

26 

13 

44 

31 

24 

23 

57 

29 

72 

236 

1,112 

68 

38 

01 

45 

30 

19 

25 

27 

22 

62 

185 

956 

-31 

34 

05 

51 

46 

11 

15 

23 

40 

54 

189 

1,181 

-236 

25 

85 

52 

57 

26 : 

1 30 

22 

44 

55 

234 

1,104 

66 

37 

93 

53 

24 

3S j 

! 32 

30 

27 

76 

233 

1 1,038 

127 

40 

37 

54 

19 

•40 

1 25 

44 

24 

84 

236 

1,158 

22 

3r> 

17 

55 

30 

30 

1 24 

49 

19 

75 

233 

1 1,146 

19 

36 

05 

3'utiils 1 

949 

042 

1 

639 

1 851 

j 826 

00 

j5294 

j 26,470 

0 

i 



pv 


324.354 

”'l25' 


= 2594 83 


5^294 

150 


= 35.29 


Replications 



I 

83,531 

^iTb^)/5 = 1,088,496/5 

= 217,699 20 

11 

32,228 

CT 

186,812.91 

111 

34,039 



IV 

19,029 

Blocks =* 

30,850 29 

V 

19,568 



VI 

40.307 



Total * 

228,762 00 



CT * 

186,842 91 



(r - f)*-* 

41,919 09 




AmiUffits of Variniire 



SS 

DF 

MS 

F 

5'/,. Pi 

Blocks. . 
X'arieties. 

30,856 29 
2,594 83 

29 

24 

lOS 12 

1.22 

1 03 

Error. . 

8,467 97 

96 

88.21 



Total 

41,919.09 

149 






IM 


THE FIELD PLOT TEST 


m X (I) - 86.28 

- 6.04 

Example 40. A Symmetrical Incomplete Block Experiment for 81 Varietiea in 
6 Replications. The sets were wntten out by setting up the variety numbers as 
follows: 


02 

03 

04 

05 

06 

11 

12 « 

13 

14 

15 

21 

22 

23 

24 

25 

31 

32 

33 

34 

35 

41 

42 

43 

44 

45 

51 

52 

53 

54 

55, 


writing out the 6 groups of sets for the 5X6 square and adding, to each, one of the 
numbers in the first row. An additional set was then made up from the numbers in 
the first row, giving 31 sets in all. The blocks were arranged as indicated in Table 
58, after randomizing the varieties within the blocks. The variety totals are collected 
as in Table 50, and it is convenient for this purpose and for obtaining the values of 
Ziiv to make up a table similar to Table 60 giving the sets with their corresponding 
numbers and block totals. Then, to collect the yields of, say, variety 23, we can 
locate it in each group, note the numbers of the sets, and then proceed from the 
table of individual yields to obtam the total Similarly to obtain S29 we add the block 
totals in the same line as 23 throughout the table. 

From this point the calculations are exactly as in PiXample 39 for 25 varieties, 
except that, since this experiment is of the v » — p + 1 t3rpe, the variance for 
the difference between two corrected variety means is 



The analysis of variance is given in Table 57 


TABLE 67 

Analtsib of Variance 

Incomplete Block Experiment for 31 Varieties in 6 Replications 



ss 

DF 

MS 

F 

5% Point 

Blocks 

1,083.491 


36,116 

10 5 

1.53 

Varieties 

103,977 


3,466 

1 01 

1 53 

Error 

429,756 

125 

3,438 



Total 

1,617,224 

185 
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TABLE 68 


I^ATION OF THE VARIETIES IN THE FxELD, CORRESPONDINQ PlOT YiELDB, AND 
Block Totals. Symmetrical Incomplete Block Experiment with 31 Varieties 

AND 6 Replications 


i 

B 

Yield 

B 

Yield 

Vwi- 

eiy 

Yield 

s 


B 

Yield 

B 

Yield 

Block 

Toiola 

1 

11 

315 

13 


01 

360 

nil 

265 

12 

355 

15 

345 

2,010 

2 

23 

246 

22 


21 

160 

B 1 

285 

24 

355 

25 

240 

1,470 

3 

01 

325 

33 


32 

300 


240 

31 


34 

350 

1,750 

4 

45 

360 

43 

230 

42 

225 

B9 

270 

41 

255 

44 

170 

1.510 

5 

01 

175 

63 

290 

51 

330 

64 

220 

62 


65 

205 

1,600 

6 

31 

105 

11 

310 

21 

315 

02 

215 

41 

330 

51 

270 

1.035 

7 

22 

290 

52 

95 

02 

140 

32 

330 

12 

410 

42 

236 

1,500 

8 

13 

255 

23 

375 

43 

305 

33 

255 

02 

235 

53 

230 

1.660 


64 

275 

44 

245 

34 

140 

24 

270 

14 

230 

02 

135 

1.295 

10 

45 

95 

35 

245 

02 

330 

25 

235 

15 

200 

55 

285 

1,390 

11 

44 

180 

11 

275 

33 

290 

55 

155 

03 

180 

22 

160 

1.240 

12 

03 

120 

32 

70 

21 

100 

15 

100 

43 

170 

54 

65 

625 

13 

63 

65 

42 

145 

31 

40 

25 

35 

03 

45 

14 

55 

375 

14 

24 

140 

13 

45 

35 

15 

03 

85 

62 

65 

41 

55 

405 

15 

45 

80 

23 

115 

34 

165 

03 

85 

51 

65 

12 

120 

620 

16 

32 

215 

• 11 

300 

45 

255 

24 

185 

04 

145 

63 

150 

1,250 

17 

13 

50 

34 

45 

65 

105 

42 

155 

21 

125 

04 

30 

610 

18 

23 

06 

16 

130 

44 

55 

31 

85 

04 

65 

52 

no 

600 

10 

25 

130 

33 

40 

41 

45 

12 

46 

64 

60 

04 

15 

335 

20 

35 

-5 

04 

70 

22 

65 

43 

35 

14 

255 

51 

80 

500 

21 

05 

180 

11 

255 

23 

290 

42 

285 

35 

270 

54 

185 

1,460 

22 

21 

150 

52 

55 

14 

50 

46 

210 

33 

265 

05 

185 

916 

23 

55 

130 

24 

215 

12 

155 

31 

05 

05 

05 

43 

155 

845 

24 

16 

210 

41 

90 

53 

95 

22 

160 

05 

140 

34 

125 

820 

25 

32 

140 

05 

195 

13 

310 

51 

195 

26 

130 

44 

285 

1,256 

26 

11 

^10 

34 

290 

43 

325 

25 

230 

62 

220 

06 

310 

1,585 

27 

12 

230 

44 

155 

35 

195 

53 

245 

06 

315 

21 

215 

1,355 

28 

13 

100 

31 

285 

54 

230 

22 

185 

45 

220 

06 

175 

1,255 

20 

14 

275 

55 

185 

06 

130 

32 

190 

41 

100 

23 

no 

1,050 

30 

15 

155 

42 

150 

24 

240 

06 

130 

33 

145 

51 

125 

945 

31 

01 

220 

05 

215 

06 

195 

03 

240 

02 

205 

04 

230 

1,305 

34,060 
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TABLE 69 

Yields of Single Plots by Varieties, Variety Totals, Values of and the 
Corrected Means tuv Symmetrical Incomplete Block Experuient with 31 
Varieties and 6 Replications 


Van- 

ety 

No. 

Single Plot Yields 


B 

pTtt„—Zuv 

/«.« 

01 

360 

285 

325 

270 

175 

220 

1,635 

9,635 

176 

193.6 

02 

215 

140 

235 

135 

330 

295 

1,350 

8,870 

-770 

163.2 

03 

180 

120 

45 

85 

85 

240 

755 

4,660 

-130 

183.8 

04 

145 

30 

55 

15 

70 

230 

545 


-1220 

148 6 

05 

180 

185 

95 

140 

195 

215 

1,010 

6,695 

-635 

167.5 

06 

310 

315 

175 

130 

130 

195 

1,255 

7,585 

-55 

186 2 

11 

315 

310 

275 

300 

255 

210 

1,665 

9,185 

805 

214.0 

12 

355 

410 

120 

45 

155 

230 

1,315 

6,665 

1225 

227.5 

13 

370 

256 

45 

50 

310 

160 

1,190 

7,090 

50 

189.6 

14 

265 

230 

55 

255 

50 

275 

1,130 

6,145 

635 

208.5 

16 

345 

200 

100 

130 

210 

155 

1,140 

6,290 

550 

205.7 

21 

160 

315 

100 

125 

150 

215 

1,065 

6,510 

-120 

184.1 

22 

185 

290 

160 

65 

160 

185 

1,045 

6,786 

-615 

171 4 

23 

245 

375 

115 

65 

290 

110 

1,200 

6,760 

440 

202 2 

24 

355 

270 

140 

185 

215 

240 

1,405 

6,210 

2220 

259 6 

25 

240 

235 

35 

130 

130 

230 

1,000 

6,410 

-410 

174 8 

31 

220 

195 

40 

85 

95 

285 

920 

6,360 

.-840 

160 9 

82 

300 

330 

70 

215 

140 

190 

1,245 

7,430 

40 

189.3 

33 

315 

255 

290 

40 

265 

145 

1,310 

6,840 

1020 

220 9 

34 

350 

140 

165 

45 

125 

290 

1,116 

6,580 

110 

191.9 

35 

240 

245 

15 

-6 

270 

195 

960 

6,865 

-1105 

152.4 

41 

255 

330 

55 

45 

90 

160 

935 

6,755 

-146 

183.3 

42 

225 

235 

145 

155 

285 

150 

1,195 

6,305 

865 

215 9 

43 

230 

305 

170 

35 

155 

325 

1,220 

6,720 

600 

207 4 

44 

1 170 

245 

180 

55 

285 

155 

1,090 

7,155 

-615 

168.2 

45 

360 

95 

80 

255 

210 

220 

1,220 

6,940 

380, 

200 3 

51 

330 

270 

55 

80 

195 

125 

1,055 

6,455 

-125 

184 0 

52 

220 

95 

65 

110 

55 

220 

765 


- 1816 

129.4 

53 

290 

230 

55 

150 

95 

245 

1,065 

6,955 

-565 

169 8 

54 

220 

275 

65 

60 

185 

230 

1,035 

6,475 

-266 

179 4 

55 

265 

285 

155 

105 

130 

185 

1,125 

6,535 

215 

194 9 


8215 

7495 

3680 

3555 

5490 

6525 

34,960 

209,760 





3^ 

.960 
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TABLE 60 

Sets Akrangbd in Order of Numbers with CourekpoiVding Block Totals. 
Incomplete Randomized Block Experiment 


Set 

No. 

Block 

Totals 

1 

01 

11 

12 

13 

14 

15 

2010 

■ 

01 

21 

22 

23 

24 

25 

1470 


01 

31 

32 

33 

34 

35 

1750 

i 

01 

41 

42 

43 

44 

45 

1510 

5 

01 

51 

52 


54 

55 

1500 


1 • 








m 

02 

11 

21 

31 

41 

51 

1635 

7 

02 

12 

22 

32 

42 

52 

1500 

8 

02 

13 ;23 

33 

43 

53 

1655 

9 

02 

1 4 

24 

34 

44 

54 

1295 

10 

02 

15 

25 

35 

45 

55 

1390 


11 

03 

11 

22 

33 

44 

55 

1240 

12 

03 

21 

32 

43 

54 

15 

625 

13 

03 

31 

42 

53 

14 

25 

375 

14 

03 

41 

52 

13 

24 

35 

405 

15 

03 

51 

12 

23 

34 

45 

620 


Set 

No 

Block 

Totals 

16 

04 

11 

32 

53 

24 

45 

1250 

17 

04 

21 

42 

13 

34 

55 

510 

18 

04 

31 

52 

23 

44 

15 

500 

19 

04 

41 

12 

33 

54 

25 

335 

20 

04 

51 

22 

43 

14 

35 

500 


21 

05 

11 

42 

23 

54 

35 

1465 

22 

05 

21 

52 

33 

14 

45 

915 

23 

05 

31 

12 

43 

24 

55 

845 

24 

05 

41 

22 

53 

34 

15 

820 

25 

05 

51 

32 

13 

44 

25 

1255 


26 

06 

11 

52 

43 

34 

25 

1585 

27 

06 

21 

12 

53 

44 

35 

1355 

28 

06 

31 

22 

13 

54 

45 

1255 

29 

06 

41 

32 

23 

14 

55 

1050 

30 

06 

51 

42 

33 

24 

15 

945 

31 

06 

01 

02 

03 

01 

05 

1395 


Giaiid Total ~ 34,960 
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20. Choosing the Best Type of Incomplete Block Experiment for a 
Given Test. After a study of the various incomplete block experiments 
it will be noted that each has certain limitations. On account of general 
simplicity the symmetrical incomplete blocks are to be preferred to the 
quasi-factorials, and in addition all comparisons are made with equal 
precision. However, for the symmetrical types we must have, when 
s = p -h 1 replications, and when v = p^ — p + 1, p replications. 
For a test of 121 or 133 varieties we require 12 replications, and if the 
number of varieties is greater than this it is obvious that in general the 
test will be more expensive than is usually warranted in such cases. At 
a certain point, therefore, it would seem that the quasi-factorials should 
be extremely useful. On accoimt of its relative simplicity the two- 
dimensional quasi-factorial with two groups of sets is preferable to the 
three-dimensional type, but the latter will probably be the most efficient 
if the number of varieties is quite large. These (>oints can now be used 
as a basis for setting up a general schedule as to the type of experiment 
best suited to a given number of varieties. For this purpose Table 61 
has been prepared, taking as a basis the number of varieties that can he 
tested by at least one of three types. 

In Table 61 the dotted lines indicate the range through which the 
methods are generally recommended. The two-dimensional quasi- 
factorial can be used at the point where the number of replications for 
the symmetrical type becomes too large. For very large numbers the 
three-dimensional quasi-factorial is probably the most efficient, but, 
since it can be applied easily only to numbers that are cubes, the two- 
dimensional type must be extended to include fairly high numbers. 

A possible objection to incomplete block experiments in general may 
be that certain numbers of varieties cannot be tested and hence the 
experimenter may feel that it is still necessary to use randomized blocks. 
However, it would seem to be desirable where possible to suit the num- 
ber of varieties to the experiment even if it involves using ''dummy” 
varieties. Also, for those who wish definitely to use other numbers than 
those listed here, Yates (20), has developed methods for laying out and 
analyzing quasi-factorials in which the dimensions are not equal. Thus 
instead of a 12 X 12 quasi-factorial for 144 varieties we might use a 
12 X 11 for 132 varieties. These modifications, however, require addi- 
tional computations and will be avoided if possible. 
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TABLE 61 

Values of p and r Required for Different Numbers of Varieties 
AND Ranges through which the Three General Types of 
Incomplete Block Experiments are Recommended 

Symmetrical Incomplete Two-Dimensional Three-Dimensional 


No. of 


Blocks 

Quasi-Factorial 

Quasi-Factorial 

Varieties 

V* 

T 

P 

r 

P r 

13 

•± 

4 




16 

4 

5 

4 

271 

2 3n 

21 

5 

6 




25 

5 

6 

5 

2n 


27 





3 37i 

31 

6 

6 

1 



36 



6 

2n 


49 

7 

8 

7 

2n 


67 

8 

8 




. M 

8 

9 

8 

2n 

4 37t 

73 

9 

9 




81 

9 

lot 

9 

2n 


91 

10 

10 




100 

10 

11 

10 

2n 


111 

11 

11 





121 

11 

12 

11 

2n 


125 





5 3n 

133 

12 

12 

I 



144 

12 

13 

; 12 

2n 


157 

13 

13 



. 

169 

13 

14 

j 13 

2n 

■ 

183 

14 

14 




196 

14 

15 

• 14 

2n 


211 * 

15 

15 



' 

216 





6 371 

225 

15 

16 

15 

2n 


etc. 





etc 


* p M number of plots in one block 
r * number of repCcations 

t Completely orthogonaliied squares greater than (9 X 9) have not yet been written, tnd 
therefore we cannot if we wished go beyimd this point at the present time. 
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TABLE 62 

Yields of Oat Yarietieb nr an Experiment on the Effect 
OF Soil Inoculation with a Root Rot Organism 


Variety 

Soil 

Treatinent 

Replicates 

1 

2 

3 

4 

I 

■■■■ 

24 1 

16.1 

31.6 

28 9 


mM 

65 4 

49 3 

39 8 

48 4 

JI 


30 6 


61 7 

42 5 


u 

51 8 


76 5 

56 6 

III 


39 1 

47.4 

36 9 

28 9 



68.7 

42.0 

81 6 

57 3 

IV 


120 1 

69 6 

96.2 

69 7 



112 2 

88 6 

102.8 

85 0 

V 


118 7 

24.1 

45 9 

10 4 


u 

58 5 

68 0 

77 7 

64 7 

VI 

1 

76.2 

66 3 

77 7 

65 3 


u 

109 1 

91 5 

124 1 

96 9 

VII 

I 

67 8 

45 9 

29 7 

56 4 


V 

112 2 

95 9 

91 1 

77 3 

VITI 

I 

58 0 

40 1 

47 6 

38 4 


u 

127 3 

66 3 

77 0 

'83 A 

IX 

I 

81 8 

23 6 

31 6 

32.1 


u 

100 3 

73 8 

81 4 

52 7 

X 


85.3 

78 2 

99 4 

85 0 



81 6 

94 3 

I 

96 4 

77.2 
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21. Exercises, 

l. The results of a randomized block experiment are given in Table 62. Ten 
varieties of oats were tested for their reaction to root rot. The plots were arranged 
in pairs of which one plot was inoculated with the root-rotting organism and one 
plot uninoculated. Analyze the results State in words the meaning of a significant 
interaction between varieties and the soil inoculation. 



DF 

MS 

Replicates ... 

3 

2,042.08 

Varieties 

9 

2,654 19 

Error (1) 

27 

270.54 

Treatments 

1 

12.226.51 

Varieties X Treatments 

9 

401.32 

Error (2). 

30 

232 30 


2. In a fertilizer experiment conducted in an 8 X 8 Latin square, the yields of 
wheat given in Table 63 were obtained The fertilizer combinations are designated 
Nf P, Kf NP, NKj NPKf 0. In the table the yields are in the exact position of the 
plots in the field, and above each yield figure is the fertilizer treatment which the 
plot received. Work out the analysis of variance for this experiment, and, by means 
of the standard error, compare: 

(a) Yields for plots receiving N with those receiving no N. 

(b) Yields for plots receivmg K with those receiving no K. 

(c) Yields for plots receiving P with those receivmg no P. 

The results for the sums of squares are given below to provide a check on the work, 
but the sum of squares for the treatments must be split up to correspond to individual 


degrees of freedom. 

SS DF 

Rows ...102.20 7 

Columns ... 84.24 7 

Treatments 513.79 7 

Error 91.99 42 


3. Complete the analysis of the split plot experiment described in Section 8, 
above. Assume that the plan of this experiment is to be rearranged so that the 
most accurate comparison is to be between D and W, and make the plan accordingly. 

The sums of squares for the three errors as given below will provide a com- 
plete check on the calculations. 

Error (1) 647.6 Error (2) 1069.1 Error (3) 931.1 

4 . Assuming that the following sets of figures represent the response to fertilizer 
at 4 levels, for each set work out the sums of squares for the total and then for the 
linear, quadratic, and cubic responses. Graph the actual yield results as given below, 
and then point out the relation between the shape of these graphs and the results 
obtained for the sums of squares 



ni 

nt 

ni 

n4 

(a) 

22 

65 

54 

78 

(6) 

19 

61 

58 

27 

(e) 

24 

58 

13 

41 
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The sums of squares are 

W (6) (c) 

Linear. 1232 46 22 06 1.80 

Quadratic 90 26 1332.26 9.00 

Cubic 396 06 14 45 1165.20 

6 . Table 64 gives the plan of a field for a 3 X 3 X 3 confounded experiment, 
with treatment numbers and plot yields. The numbers such as 123 and 321 represent 
NiKiPt and NzKiPi Cyclic set II was used to confound 2 degrees of the triple 
interaction N X K X P with blocks. Work out the complete analysis of variance 
for this experiment giving the results for treatment effects by individual degrees of 
freedom. 

The following excerpts from the results for the sums of squares will assist in 

checking the calculations. 

Total for treatments. 2,434.93 

Nr 9 46 

NrXKd 4.73 

Kd X Pr ^8.90 

N X K^X P 149.98 (for one pair of DF) 

Error . . 5.770 81 

6. Table 65 gives the plan of the field with variety numbers and corresponding 
plot yields for a two-dimensional quasi-factorial experiment with two groups of sets. 
Make a complete analysis of the results. 

The variety sum of squares is 263,638. 

7. Table 66 gives the plan of the field with variety numbers and corresponding 
plot yields for an incomplete block experiment with 21 varieties. Analyze the results, 
and make a test of the significance of the mean difference between the varieties 
01 and 04. 

8. Prepare plans for the layout of : 

(а) IVo-dimensional quasi-factorial experiment to test 36 varieties. 

(б) Symmetrical incomplete block experiment to test 31 varieties. 

(c) Three-dimensional quasi-factorial experiment to test 125 varieties. 



2 ( 


P 

18 8 


N 

12 9 


NK 
10 7 


PK 
18 3 


NP 
17 9 


K 

14 9 


NPK 
19 0 


O 

17.6 


Fertilizer Experiment 


N 

NP 

K 


o 

NPK 

PK 

12 2 

18 3 

15 8 

HI 

11 5 

19 4 

18 9 

NK 

PK 

NPK 

P 

K 

NP 

O 

7 3 

17 4 

17 2 

19 7 

12 0 

19 0 

15 6 

NP 

N 


O 

NPK 

PK 

K 

17 6 

10 4 

wm 

9 8 

16 6 

17 5 

14 3 

K 

NPK 

o 

N 

NP 

P 

NK 

12 6 

14 2 

12 2 

11 4 

14 5 

16 9 

16 1 

1 ® 

NK 

N 

PK 

P 

K 

NPK 

1 12 8 

13 3 

11 3 

16 5 

15 6 

10 9 

16 7 

• 

PK 

0 

NP 

NPK 

N 

NK 

P 

18 2 

12 8 

17 1 

15 8 

9 5 

8 9 

20 6 

P 

A' 

PK 

NP 

NK 

O 

N 

18 9 

11 2 

17 1 

17 9 

8 6 

10 2 

14 5 

NPK 

P 

NK 

K 

PK 

N 

NP 

20 4 

20 8 

16 4 

16 8 

18 5 

13 6 

23 0 
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TABLE 64 

Plan or Field and Plot Yields roR a (3 X 3 X 3) Confounded Experiment 


Variety 

Yield 

Variety 

Yield 

Variety 

Yield 

111 

465 

112 

364 

113 


123 

395 

121 

348 

122 


132 

556 

133 

421 

131 

463 

213 

343 

211 

455 

212 

346 

222 

413 

223 

374 

221 

394 

231 

408 

232 

607 

233 

363 

312 

337 

313 

421 

311 

449 

321 

421 

322 

374 

323 

217 

333 

308 

331 

334 

332 

355 

333 

353 

121 

381 

332 

244 

312 

486 

133 

403 

323 

246 

213 

219 

313 

75 

113 

82 

321 

544 

211 

325 

122 

280 

123 

478 

331 

141 

221 

195 

231 

391 

112 

259 

131 

196 

222 

311 

223 

254 

311 

178 

111 

302 

322 

259 

212 

222 

132 

542 

232 

398 

233 

309 

222 

374 

133 

299 

• 

311 

196 

321 

358 

331 

273 

131 

259 

213 

468 

232 

437 

122 

361 

231 

316 

322 

485 

233 

345 

111 

307 

121 

311 

221 

207 

333 

570 

313 

343 

113 

16 

312 

427 

211 

353 

323 

199 

123 

380 

112 

454 

212 

,114 

132 

400 

223 

251 

332 

240 

132 

611 

121 

403 

113 

302 

123 

444 

331 

338 

323 

256 

312 

550 

322 

405 

311 

367 

333 

573 

223 

331 

131 

268 

213 

706 


522 

221 

400 

111 

423 

211 

319 

332 

446 

321 

749 

232 

383 

233 

515 

231 

529 

133 

292 

212 

420 

222 

424 

112 

554 

122 

384 
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TABLE 65 

Plan op a Field with Variety Numdeiis and Corrbbfondinq Plot Yields for 
A Two-Dimensional Quasi-Factorial Experiment with 49 Varieties 
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TABLE 66 

Plan of Field with Variett Numberb and Corresponding Plot Yields for • 
A Sthmetrical Incomplete Block Experiment with 21 Varieties 


Vari- 

ety 

Yield 

Vari- 

ety 

Yidd 

Vari- 

ety 

Yield 

Vari- 

ety 

Yield 

Vari- 

ety 

Yield 

Block 

Totals 

13 

465 

11 


14 

556 


343 


413 

2170 

22 

408 

21 


24 

421 


308 


353 

1827 

31 

486 

01 


34 

544 


478 


391 

2118 

01 

311 

44 

302 

43 

542 

42 

374 


358 

1887 

41 

468 

21 

316 

mm 

307 

11 

570 

31 

427 



380 

32 

400 

42 

611 

12 

444 

22 

550 

2386 

43 

673 

13 

706 

23 

423 

02 

749 

33 

529 

2980' 

44 

424 

02 

638 

24 

736 

14 

488 

34 

768 

3044 

22 

364 

11 

348 

44 

421 

33 

455 

03 

374 


12 

607 

34 

421 

43 

374 

21 

334 

03 

381 


24 

403 

42 

75 

13 

325 

31 

141 

03 

259 


32 

254 

23 

259 

03 

398 

14 

299 

41 

?73 

1483 

24 

437 

11 

485 

32 

311 

04 

343 

43 

353 

1929 

14 

454 

33 

251 

21 

403 

42 

338 

04 

405 

1851 

31 

331 

04 

522 

44 

319 

12 

383 

23 

292 

1847 

22 

554 

04 

626 

41 

753 

34 

505 

13 

668 

3106 

42 

549 

34 

348 

05 

463 

H 

346 

23 

394 

2100 

06 

363 

21 

449 

13 

217 

44 

355 

32 

244 

1628 

43 

246 

31 

82 

14 

280 

22 

195 

05 

196 

999 

41 

178 

.06 

222 

24 

309 

12 

196 

33 

259 

1164 

02 

361 

04 

345 

01 

207 

05 

16 

03 

199 

1128 






















CHAPTER Xlll 


THE ANALYSIS OF VARIANCE APPLIED TO LINEAR 
REGRESSION FORMULAE 

1. Significance of the Regression Function. If, in a series of paired 
values, y is tlio dependent and x is the independent variable, the regres- 
sion of y on a: is represented by the linear eejuation Y =■ y + h{x — x), 
where h is the regression coefficient and F* is a value of y estimated from 
tlie equation for x = Xx. Now if the equation is used to estimate each 
value of y from the corresponding values of x, it can fie shown that 

(1 - 7^)^{y - y)^ = 2(2/ - >T| 
r^^(y - yY = 2 ( 1 ' - y)^\ 


And since 2 ( 2 / — y)- = (1 — r^)^{y — yY + it is obvious 

that, if the total sum of squares for the dependent variable is broken up 
into two parts, one part 2 ( 1 / — F)^, representing deviatiqns from the 
repression function, and another part 2(1' — y)‘\ representing that 
portion of the total variability that is accounted for by the regression 
function, these two parts are proportional to (1 — r^) and respectively. 
It should be clear that 2(y — represents deviations from the regres- 
sion fimction because for each value of y we arc taking the square of the 
deviation of that value from the corresponding Y value on the regression 
line. Similarly S(F — i/)^ represents the regression function itself 
becauwse for each value of y we take the square of the difference between 
y and the corresponding point on the regression line. As the slope of 
the regression line increases, 2(,F — y)^ must increase also, and as the 
y values approach more closely to the regression line the value of 
2(y — Y)- decreases correspondingly 

The direct relation between S(F — y)- and the regression equation 
may be shown by equating it to 

2(r - sr = 2:{j? + Hr -X)- yY- = 6221(1 - £)2 ( 2 ) 

In the expression on thn right SCx — x)* is obviously independent of the 
correlation so that any variations in S(F — yY arc due entirely to 5. 
Tills is an important concept as it shows that, since the value of 
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^ for any given distribution of y is dependent on a single 
statistic bf it must represent only 1 degree of freedom. Hence the 
analysis of variance corresponding to equation (1) will be: 



j 

‘Sum of Squares 

DF 

Mean Square 

Regression function. . 

— i)2 

i 

6*Z(* - 4)* 

Deviations from regression 
function. 

SO/ - F)* 

n' - 2 

S(y - F)*/"' - 2 

Total 

mm 

n' - 1 



where n' is the number of pairs of values of x and ?/ 

In calculating the sum of squares 6“ 2(3^ — i - it is frequently con- 
venient to make use of the equality 


622(j. - x)2 


\^tj{x - j)P 
S(x - Jc)2 


(3) 


If b has already been obtained it is of course just as convcMiicnt to mul- 
tiply S(x — x)^ by 62. 

If the correlation coefficient has been determined, a short method 
of determining the significance of r,y which is exactly comparable 
to determining the significance of 6^* arises from the substitution of 
(1 — r-) 2(1/ — J)2 for S(i/ — F)2, and — y)'^ for — x)'-*, in 
the sum of squares column of the anabasis of variance*. Then F works out 
to r2 (n' — 2)/l — and this is all the calculation necessary. In other 
words, for a total correlation or a regression coefficient, F = and 
tables either of F or of f may be used to t est their sigmficaiiec. Refer 
here to Chapter VII, equation (11), and note that F = vjvt. 

2, Test for Non-Linearity. When correlation data arc set up m the 
form of a correlation table the total sum of squares may bo split up into 
two portions, one part representing differences betwi'cn tlu* means of 
arrays and the other representing difterem‘es between vuluo.s within 
arrays. The equation is 

2(2/ - y)2 = 2np(yp - y)^ + 22(i/ - y^)- (4; 

Betweeu W itliin 


where Wp is the number in an army and ?/,, is the mean of an array. The 
second summation in the term on tluj right means that the sums of 
squares are first computed for each array and those are siimmated. 
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The equation for the corresponding degrees of freedom is as follows: 

n' - 1 = (j - 1) + (n' - q) (6) 


where q is the number of arrays in the table. 

If we picture the sum of squares for between arrays as being due to a 
set of means running diagonally across the table following in general the 
regression straight line, it is obvious that the sum of squares for between 
arrays includes the sum of squares — £)^, worked out above for 
deviations due to the regression function, and that the remainder will be 
due to deviations of the means of arrays from the regression line. The 
equation is 

2np(y, - = 2n,(y, - F)* + - £)* (6) 

Between Deviatione Due to linear 

of means of regrsMion 

arrays from 
regression hne 


If the means of arrays fall directly on the regression line, ^np{yp — Fj* 
will be zero, and correspondingly its value will increase as the trend 
of the mean values gets farther away from the trend of the straight 
regression line. Then since the sum of squares for within arrays 
measures the random variability in the values of y a comparison of tlie 
estimates of variance obtained from 2ny(yp — F)^ and S2(y — y^f 
should provide a measure of the linearity of regression, or the goodness 
of fit of the regression straight line to the data in question. 

The equation for the degrees of freedom corresponding to equation 
(6) will be (g — 1) = (g — 2) + 1. 

The complete analysis of variance may be represented as follows: 



Sum of Squares 

DF 

Sum of Squares 

DF 

Between arrays 

- J)* 

^Linear 

regression 
^Deviations, 
means of arra3rB 
from regression 
line 

W -q 

6»2(x - i)* 

m 

Within arrays 

22(y - ?,)* 

2np(Sp - F)» 

1 

Total 

2(y - S)* 

n' - 1 
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For the purpose of testing linearity, however, it suffices to set up: 



Sum of Squares 


Variance 

Deviations, means of arrays 
from regression line 

2 np(fo - n* 

H 

2 np(Sp - y)*/« - 2 

Within arrays 

2S(v - V,? 


- Pp)*/n' - g 

Total 

Z(y - Y)* 

B 



There are various methods of obtaining the sums of squares for the 
above analysis, but one of the most convenient and direct is first to 
calculate 2!np(9p — making use of the identity 


We square the total of each array and divide by the number in the 
tflray. These are summated, and from the sum we subtract the square ^ 
of the y total divided by the number of paired values. Then we calcu- 
late and, S(y - g)* being known, the two sums of squares 

required can be obtained by subtraction. The procedure is obvious by 

reference to the outline of the analyffls of variance above. 

• 

Ezsmple 41. Significance of a Regression Function. In Chapter VII, Ex- 
ample 13, we determined the correlation coefficient for the yields of adjacent barley 
plots and in Chapter VI, Example 11, we determined the regression line. UsiDg the 
same data and the analysis of variance to test the significance of the regression 
function we should get a similar result The sums of squares are 

2(x - f)* - 3962 - 860*/200 » 339.60 
- «)* « 0.4492* X 339.60 - 68.60 
Sfy - fi)* - 8180 - 1246*/200 - 417.42 
2(y - V)* - 417.42 - 68.60 - 348.92 


Then the analysis of variance is aa follows: 



Sum of 

DF 

Variance 

F 

1% Point 


Squarep 




Regression function 

68.60 

1 

68.60 

38.9 

6.76 

Deviatioras from regression 

348.92 

198 

1.762 



Total 

417.42 

199 





The F value is well beyond its 6% point, indicating a high degree of significance. 
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EiAm^e 42. The Test for Non-Lineiritj. We shall again use the data of 
Chapter VI, Table 12, for this test. Since we already have SQ/ — fi)* (Example 41, 
above) the first st^ is to calculate Xripiyp — yf. In Chapter VI, Table 13, the 
totals for the y arrays are given, so we proceed as follows: 


Between arrays 

20*/4 + 60®/13 +• • • + 42*/6 - 

1246*/fi0- 78.70 

Linear regression 

Deviations from regres- 

6»2(* - f )* - 0 4492* X 339 60 

- 68.60 

sion 

Stipd/p — y)* = Difference 

- 10.20 

Total 

2(v - S)* 

- 417.42 

Between arrays ... . 

2n,(Sp - S)* 

- 78.70 

Within arrays 

2)Z(y — ^p)* = Difference 

= 338 72 


Setting up the analysis of variance, we have: 



Sum of 
Squares 

DF 


B[ 

5% Point 

Deviation means of arrays 
from regression line. . 

10 20 

5 


1 16 

2.26 . 

Within arrays 

338 72 

193 





The F value does not approach its 5% point, so we conclude that there is no evidence 
of non-linear regression 

3. Significance of Multiple Correlations. In multiple correlation 
where xi represents the dependent variable and X 2 and zz two independ- 
ent variables the regression equation is 

aJi = + 6 i2(x2 — ^ 2 ) + hiz{xz — fe) (8) 

and this may of course be extended for any number of variates. The 
normal equations corresponding to (8) are 

2xi(x2 — X 2 ) = bi 2 ^{x 2 — + bizZxzixz — X 3 ) 

(9) 

2xi(X3 — £ 3 ) = bl22iX2(X3 — X 3 ) + blS^CXs — X 3 )* 
and from these we can derive the solution 

S(xi — xj)^ = 2(xi — A'i)2 + bi 2 Sxi(x 2 — X 2 ) + &i32xi(x3 — £ 3 ) (10) 

This equation corresponds to (1) above where the first term on the right 
represents the portion of the sum of squares for xi that is independent 
of X 2 and X 3 . The other two terms on the right represent the portion of 
the sum of squares for xi that is dependent on X 2 and xs- These terms 
may of course be written bi 22 !(x 2 — £ 2 )^ and 6i82)(x3 — ft)*, in which 
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form they correspond to 6^S(x — as above in equation (2). Equa- 
tion (10) may also be written 

2(xi - x,)2 = (1 - /22)S(xi - xi)2 + fl2S(a:i - xi)'^ (ii) 

where R is the multiple correlation coefficient. Also 

(1 - B2) Z(xi - xi)2 = X{xi - Xif, and R^X{xi - xi)^ 

= bi22xi(X2 — fe) + bl 32 xi(X 3 — X 3 ). 


It follows from (10) and (11) that a multiple regression can be 
expressed es an analysis of variance as follows: 



Sum of Squares 

D F 

Variance 

F 

Recreuion 

function 

- Jri)* 

P 

K«S(xi - x,)Vp 


DeviatiODB 

from 





regression 

function 

(1 - - ii)> 

n* — p — 1 

(1 - - i-i)* 


• 



n' — p — 1 


Total 

r(*i - *i)« 

fi' — 1 




where p is the number of independent variables. To test the significance 
of a multiple correlation therefore it is only necessary to find 



and look up the 5% point of F corresponding to n\ p and 
n> = n' — p — 1. 

Example 43. The Significance of a Multiple Correlation. Let Hi 2345 » 0 6457, 
and it has been obtained from a senes of 84 values of xi, X 3 , X 4 , !.nd xs W’e have 



For p = 4 and n' — p — 1 ®= 79, the 1% point of F is 3.56, so that the multiple 
correlation is highly significant. 

4. Special Applications. The analysis of variance can be used to 
determine the significance of the additional information obtained in cal- 
culating multiple correlation coefficients. This method was used by 
Geddes and Qoulden (2) in a practical problem in cereal chemistry. 
Correlations were first determined between loaf volume of wheat flour 
and the percentage of protein. In later studies the protein was sep- 
arated into two portions, peptized and non-peptized, and using these two 
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portions as variables the multiple correlation for their c(mibuied effect on 
loaf volume was calculated. If the proportions of th e two kinds of pro- 
t^ haye an important effect oja loj^jr^^ 

should be significantly higW thu .thft M^p!sL.SPrrelatianjQr. tQtal.inn- 
tein and lo^ Yplums. A method of comparinj; the two cprr^tions 
would detenn}jQa..th.erefQre .the pri^cid signific ance , for purpose s of 
predicting flour quality, of knowing the amounts of peptii^d and non- 
peptized protein in addition to the total protein. 

If we let represent ioaf vplun^, xg the peptized protdn, Xs the 
non-peptized protein, and Xp the total protein, the corresponding simple 
and m^tiple.correla.tipn coefficients are rip and £ 1 . 23 ^ The total pro- 
tein is of course (xs + xal, the sum of the two fractions. 

Assuming these correlations to be determined from 20 pairs of valu^, 
the sums of squares representing deviations from the regression function 
are proportional to (1 — rip) and (1 — £i. 23 j|i respectively, and the 
corresponding degrees of freedom are 18 and if. The effect of using 
more variables to estimate xi as in the case of multiple regression is to 
decrease the sum of squares due to deviations from the regression func- 
tion, but for each additional variable introduced 1 degree of freedom is 
lost and unless the reduction of the sum of squares is more than propor- 
tional to the loss in degrees of freedom there is no gain in precision. An 
analysis may therefore be set up as follows: 


Deviations from regression of 

XpOn xi 

Deviations from regression of 
X 2 and xa on xi, .... 
Additional degree of freedom 


Sum of Squares 

DF 

1 rfp 

18 

1 ■“ £? 23 

17 

1 

1 

1 

N.M 

1 


Variance 


0) 

( 2 ) 


Appl 3 dng the z t est to the mean squares ( 1 ) and (^, using ( 1 ) as an 
we can determine the significance of the gain in information duejo 
the addition of another variable. 

In one actual experiment for a series of 20 flours from No. 2 Northern 
wheat rip == 0.511 and £1 23 = 0.732. The analysis gives: 


Sum of Squares 

DF 

Variance 

F 

1 % Point 

1 -rh 

0.738870 


■■1 



1 - fii.33 

0.464176 

mSM 




Difference 

0.274703 

1 

Hi 

10.06 

8 40 













aXBRClSSB 




In this case there was a dedded gab in infonnation owiqg to the 
separation of the protdn into two componeote. 

In the general case to which this method may be applied note that 
(1 - r*) represents (n' - 2) degrees of freedom and (1 - Jp), 

(n' - p - 1) degrees of freedom. The difference between the two 
sums of squares will be represented therefore by (n' -2)- 
(n' — p — 1) = (p — 1) degrees of freedom. 

8. Ezardses. 

1. For the data in Chapter VI) Table 15, determine the significance of the re- 

gression function by means of the analysis of variance, where the flour carotene is 
taken as the dependent variable. ^ ^ 159 5. 

2. For the same data as in Exercise 1 above, test for linearity of regression. 

F - 3.21. 

8. Apply the test for non-linearity to the data in Table 67 for the relation between 
loaf volume according to a standard baking formula and the percentage protein of 
wheat flour. If there is evidence of non-linearity calculate the regression equation 
and make a graph showing the regression line and the means of the arrays. 

• 

TABLE 67 

COBIUBLATION SURFACE FOR RELATION BETWEEN PrOTEIN AND LOAF VoLUMB 


Protein in Percentage 



li.O 

11.5 

12.0 

12 5 

13 0 

13.5 

14 0 

14.5 

15 0 

15.5 


950 








1 



1 

900 







2 

5 



7 

850 





5 

2 

5 

6 

3 

1 

22 

stn 




6 

15 

7 

3 

12 

2 


45 

Loaf 750 
volume 
in cc. 700 



9 

12 

14 

5 

6 

2 

1 

1 

50 


1 

7 

3 

1 

1 





13 

650 


4 

5 

2 







11 

600 

4 

6 

2 








12 

550 


2 









2 

500 

1 










1 


5 13 23 23 35 16 16 26 6 2 


164 
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4 . For n* — 40, determine the multiple correlation R 1.234 that is just significant. 
6. Determine the significance of the gain in information through the calculation 
of multiple correlations in the examples given below. For each comparison, state 
your conclusion m words 


n' 

» 40 

ri2 = 0 7643 

1^1 234 “ 0.8031 

n' 

= 62 

ri2 « 0.8744 

3346 ^ 0.9664 

n' 

= 20 

rj2 * 0 7621 

33 — 0.7635 

n' 

» 20 

ri2 = 0.7316 

1 ^ 1*23456 “ 0.7329 



CHAPTER XIV 


NON-LINEAR REGRESSION 


1. An Example of Non-Linear Regression. In Chapter XIII, 
Section 5, Exercise 3, a test for non-Uncarity was applied to a correlation 
surface foi the relation between protein and loaf volume of wheat flour 
in a baking experiment. The non-linearity is significant, and on plotting 
the means of the arrays we find that with increasing protein there is at 
first a very rapid increase in the loaf volume, but with higher protein 
flours the increase in loaf volume is slower and finally there are indica- 
tions that the loaf volume is actually decreasing. Here we have a 
typical example of non-linearity, and it is obvious that, in such cases, 
methods for the prediction of values of the dependent variable from 
specific values of the independent variable cannot be based on a straight- 
line equation. 

2. The Correlation Ratio. In cases of non-linear regression the 
correlation ratio (1) is sometimes used to represent the relation between 
the two variables. The correlation ratio is defined by 


Vyx 


ZCy - »)* 


( 1 ) 


and its relation to the correlation coefficient will be obvious from the out- 
line of the analysis of variance of Chapter XIII, Section 2. The corre- 
lation coefficient may be defined as follow's if wc take into account its 
numerical value only: 


2;(r - y) 


and it is clear that in the correlation raiiio the numerator contains the 
sum of squares 2(7 — plus the sum of squares due to deviations of 
means of arrays from the regression line. Hence is always greater than 
unless the means of the arrays fall exactly on the regression line. The 
correlation ratio measures the total variability of the means of arrays, 
and this may be due in part either to a linear relation between the vari- 
ables or to some other type of relation. It does not, however, represent 
a relation that can be expressed by a mathematical equation, either 
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linear or curvilinear. The correlation ratio is therefore not a very satiS' 
factory statistic as it cannot be used to predict one variable from another. 
Its use must be confined to a measurement of the significance of the 
total variability of the means of the arrays and in this respect must be 
interpreted in terms of the aiudysis of variance. Thus in Chapter 
XIII, Section 2, the analysis of variance test will involve a comparison 
of the variance between arrays with the variance within arrays. 

The popularity of the correlation ratio was occasioned partly 
the use of Blakeman’s criterion — r^) as a test for linearity (1). 
R. A. Fidier (3) has shown that this test is not satisfactory and that the 
analysis of variance can be used as described in Chapter XIII to provide 
an acciuate test. The correlation ratio as such is therefore not much 
used at the present time. It may frequently be necessary to apply a 
test of significance to the variance for the means of arrays in a correlation 
surface, but this does not necessitate the actual calculation of ihe corre- 
lation ratio. EUaborate methods have been developed for testing the 
significance of the correlation ratio, but these are now uimecessary as 
the problem has been completely solved by Fisher’s z distribution and 
the analysis of variance. The test, as we have noted in the previous 
chapter, is now quite simple. 

3. Types of Regression Equations. The procedure in making a 
critical study of the relation between two variables when this relation is 
non-linear is to endeavor to find some type of mathematical equation 
that will ^ve a good fit. This is obviously not always a simple problem 
as there are a number of types of equations to choose from and in each 
case the method of making an accurate test of the goodness of fit must 
be considered. The first step is to examine the trend of the values in the 
regresrion graph and from its general characteristics decide as to the 
type of equation to be used. After the type has been selected the 
actual equation must be determined by direct methods. 

The simple straight-line equation that we have dealt with previously 
is 

F = ® + M* — = y — h^ + 

and since y — is a constant we can write this equation in the form 

F “ Co "I" Cl* 

where Co == y — and ci = by., the regression coefficient. This is a 
convenient form with which to represent the various kinds of r^yession 
equations, which in general are of two types: (1) polynomials, and (2) 
logarithnuc. Typcal examples are as follows: 
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POLTKOMIALB LOGAHTTHMIC 

F — flD + Cl* F » €b + Cl log * 

F ■■ cb + Cl* + c*** log F * €b + cix 

F “ Co + Cl* + cs** + Cl** log F • Cb + C] log* 

etc. etc. 

Of the polynomials the first is the simple straight-line equation, the 
second is the simple parabola or quadratic, and the third is the cubic. 
The simple parabola has only one maximum or minimum point, and 
there are no points of inflection. The cubic has both a maximum and a 
minimum point and one point of inflection. Curves of higher degree 
have more maximum and minimum points and tend to twist oftener and 
more rapidly. A most interesting characteristic of the polynomial 
equations is one that has already been noted in Chapter XII, in dealing 
with the separation of sums of squares corresponding to individual 
degrees of freedom. The effects represented by the polynomials of 
different degree are independent, and we refer to them as the orthogonal 
polynomials. This property is of particular value in curve fitting as it 
simplifies materially the problem of testing the goodness of fit at each 
stage of fitting. 

Logarithmic curves may be regarded as modifications of the other 
types. Thus the straight-line equation F = co + cix may be changed 
to a logarithmic equation by replacing x by log x. The result of this 
change is a cl'owding together of the x ordinates farthest away from zero. 
A straight line with a positive slope is changed therefore to a curved line 
which has a very decided slope at the origin but changes rapidly as z 
increases and reaches a point finally where the slope is fairly constant 
but much less than that of the original straight line. Logarithmic 
curves, in addition, cannot be used to represent negative values, and in 
this respect are therefore much more limited in their application than 
the polynomials. 

The characteristics of the different types of equations are most easily 
learned by working out the Y values for some imaginary equations and 
plotting the curves on graph paper. 

4. A General Method of Fitting Polynomials. With the data such 
as those of Table 67, Chapter XIII, before us in the form of a correla- 
tion surface, we may inquire as to the possibility of expressing the rela- 
tion between protein and loaf volume by some simple mathematical 
equation, the end result of our inquiry being to obtain the best method 
available for predicting the loaf volume that will be obtained from the 
flours of a given protein content. The selection of the best type of 
equation is fairly easy in this case. First we prepare a graph of the 
means of the y airays as in Fig. 12, connecting the points with a dotted 
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line. The general trend of the points seems to follow fairly closely 
the first half of the second degree parabola, or of the portion of a third- 
degree curve up to the maximum point. There is very little resemblance 
to a logarithmic curve as the first portion of it is nearly straight and with 

10 

9 

6 

7 

6 

V ft 


3 

2 


t23''56789l0 

FiQ 12 — Graph of means of y arrays; data of Table 67. 

a greater curvature towards the end. Of course polynomials of higher 
degree may give a better fit than those of the second degree or third 
degree, and the problem resolves itself therefore into the selection of a 
polynomial that will give the greatest degree of precision in predicting y 
from particular values of x, 

SELECTION OF EQUATION GIVING THE BEST PIT 

The problem of selecting an equation of the degree that gives the 
greatest precision for prediction purposes is of paramount importance 
in curve fitting and one which may easily be overlooked in a maze of 
technical details leading to the fitting of curves of a high order. Unless 
w-e can be sure that a curve fits better than a straight line it would be 
better net to use the curve. In certain cases the improvement in fit due 
to one equation over another is clearly visible by inspection, but this is 
certainly not generally true. For example, in comparing second and 
third-degree curves, the latter often appear to fit better than the former, 
but a critical test may show that the situation is definitely otherwise. 
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In the methods of curve fitting described below, particular attention 
is given to the problem of determining goodness of fit. We begin by 
fitting a straight line or a curve of low degree and follow up with addi- 
tional stages of fitting. At each stage one degree of freedom is utilized 
in fitting, and the variance represented by this degree of freedom is tested 
against the error of regression. As a general rule, when a curve has 
been obtained that passes reasonably well through the points, and if in 
making use of an additional degree of freedom there is no gain in preci- 
sion, the curve of lower degree fitted previously is taken as giving the 
best fit. 


METHOD 

The fitting of polynomials is an application of the method of least 
squares. Where Y represents the values of y estimated from the regres- 
sion equation for given values of x, the type regression equation is as 
follows : 

Y = CQ + C\X + C2X^ + • • • + CmX^ (3) 

and consequently the error of estimation is given by 

^{y — Y) = X{y — Co — cix — cjT- - • • — CmJ"*)- (4) 

The best values for substitution in the equation for co, cj, C 2 , • • Cm are 
taken as those that give a minimum value to Si*/ — }")- Minimizing 
the expression on the right in (4) we obtain a set of w + 1 simultaneous 
equations, where m + 1 is the number of unknowns and m is the higliest 
power of X in the polynomial equation to be. denviKl These simultane- 
ous equations are known as the normal eguntionSf owing to the sym- 
metrical nature of the coefficients. For the general case they are as 
follows, where x and y are measured from their mi'ans : 


n'co + S(xJcj 


+ ■ 

■ • + = S(y) 

S(r)co + 

+ 2S(x-'’)ra 

+ 

■ ■ + S(x"+‘)rm = ^Uy) 

S(r*)co + S(r*)ci 

+ S(J^)C2 

+ 


2(x")co + S(x” ♦■*)ci 

+ 2:(x-"+2)c2 + • • 

■ + = £(i"y) 


The symmetrical nature of the coefficients allows for a method of 
solution commonly known as the Doolittle method wherein the total 
Eimount of calculation involved is very considerably reduced as com- 
[lared with the ordinary method of soKdng a .sot of simultaneous equa- 
tions. After Co, ci, C 2 , - * * Cn, have been solved for, the setting up of the 
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regreasion equation is merely a matter of substituting the values of these 
statistics in equation (3) 

TESTING THE GOODNESS OF FIT 

The method of testing the significance of the variance corresponding 
to each degree of freedom used in fitting is merely an extension of the 
method described in Chapter XIII for testing the significance of a 
straight-line regression function. 

Let Ro = 2(y — j?)®, Ri = 2(y — Ti)*, and 2(Fi — J)* is the 
sum of squares due to the regression function for one degree of fitting. 
The analysis is of the form: 



ss 

DF 

Regrefifiion function 

2(yi - «)* 

1 

First residual . . 

ill - 2(1, - Fi)* 

n' - 2 

Total.. 

flo - 2(y - g)* 

n' - 1 

t 


If a second statistic is fitted the residual Ri will be reduced by an amount 
equal to the difference between the sums of squares for the two regres- 
sion functions, i.e., by 2(F2 — y)* — S(Fi — y)®, which for conve- 
nience we will put equal to 2(F'i — Fz)®. The new residual may be 
represented by R 2 , and the analysis will be: 



ss 

DF 

Difference, regression func- 
tions 

2(ri - r*)® 

1 

Residual . 

Ri 

n' -3 

1 

First residual . 

Ri 

n' -2 


Obviously this process can be continued indefinitely, providing at each 
stage a test of the significance of the additional statistic fitted in the 
regression equation. Isserliss has shown how the sums of squares for 
each regression coefficient can be obtained simultaneously with the so- 
lution of the equations for the unknowns. His method involves solving 
for the regression coefficients co, ci, • • • Cm, by means of algebraical 
formulae, and since this method appears to be somewhat laborious, the 
work in the follovdng examples is performed in tables by a technique 
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similar to that used in solving the equations for partial regression and 
correlation coefficients. It is shown also how the sums of squares 
required for the tests of significance may be obtained directly from 
these tables. 

The analysis of variance test as used here should not be confused 
with the test for non-linearity as described in Chapter XIII. The 
regression straight line may not be a good fit, but, if it is a better fit than 
the horizontal line representing the mean of y, the test we use here will 
show it to be significant. At the same time, the test for non-linearity 
will indicate significant deviations of the means of the y arrays from the 
regression line. As a matter of fact, after fitting a straight line it is 
desirable to apply the test for linearity. If there is no evidence of 
non-linearity there is no object in proceeding to the fitting of a curve of 
higher degree. 

Ezample 44. For this example we shaU use the data of Table 67 and fit poly- 
nomials by successive stages up to the third degree. 

The first step in the procedure of fitting regression lines is to obtain the values of 
{he coefficients for the normal equations. These are best obtained as in Table 68, 
which is divided into sections, each section representing the data necessary for cal- 
culating one additional constant. Thus Section A is necessary for fitting a straight 
line; if we wish to fit a second-degree curve we proceed with Section B, and so forth. 
This is continued until it is obvious that further fitting is unnecessary. In actual 
practice we w^l probably not have to go beyond fitting to the third degree. 

Note that the actual classes for both y and x are replaced by 1, 2, 3, 9. This 

reduces the labor a great deal, and, when the Y values have finally been calculated 
for drawing the curve, they may be converted to actual values by the method de- 
scribed in Chapter II, Section 8, for converting means; or the whole equation may be 
converted to actual values by methods similar to those described in Chapter VI, 
Section 6. 

The easiest method for calculating the sum of the powers of x is by continuous 
multiplication First, Nxjfl is calculated for each array, and to obtain the figures in 
we simply multiply each of the values by x When we reach the last 
column of one section it is good practice to check this column using a table of powers 
of z. This checks all the previous calculations of the powers of x. 

Having carried out the calculations as in Table 68, Section A, we write the normal 
equations for fitting a straight line. For the general case these are 

n'cto + 2(a;)ci •= Z(y) 

(5) 

2(z)cib + 2(**)C1 - 2(av) 
and substituting the actual coefficients we have 


164cd + 861ci - 1014 
851qi + 5181C1 - fiOOfi 
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TABLE 68 

Calculation or CoBmciBNTs for Fittino a Polynomial up to thb 

Tbibd Deqbeb 

Section A 


y 

Frequency 
of y for X 
Arrays 
Nyl 

Totals 
for y 
Arrays 
Ty. 

Means 
for y 
Arrays 
Vx 

X 

Frequency 
of X for y 
Arrays 



XTyx~ 


1 

1 

13 

2 


1 

5 

5 

5 

13 

33 8000 

2 

2 

43 

3 


2 

13 

26 

52 

86 

142 2308 

3 

12 

115 

5 


3 

23 

69 

207 

345 

575 0000 

4 

11 

137 

5 

9565 

4 

23 

92 

368 

548 

816 0435 

5 

13 

234 

6 

6857 

5 

35 

175 

875 

1170 

1564 4571 

6 

50 


6 

6667 

6 

15 

90 

540 

600 

666 6667 

7 

45 

115 

7 

1875 

7 

16 

112 

784 

805 

826 5625 

8 

22 

199 

7 

6538 

8 

26 

208 

1664 

1592 

1523 1154 

0 

7 

41 

7 

3333 

n 

6 

54 

486 

396 

322 6667 

10 

1 

11 

7 


Q 

2 

20 

200 

140 

98 0000 


1G4 

1014 



164 

851 

5181 

5695 

6568 5427 


Section B 

Section C 





A'xxx* 

= 

5 

5 

13 

5 

5 

J — 

13 

104 

208 

172 

416 

832 

344 

621 

1,863 

1,035 

6,689 

16,767 

3,105 

1,472 

5,888 

2,192 

23,552 

94,208 

8,768 

4,375 

21,«75 

5,850 

109,375 

546,875 

20,250 

3,240 

19,410 


116,640 

699,840 

21,600 

5,488 

38,416 

5,635 

268,912 

1,882,384 

39,445 

13,312 

106,496 

12,736 

851,968 

6,815,744 

101,888 

4,374 

30,366 

3,564 

354,294 

3,188,646 

32,076 



1,400 


2,000,000 

14,000 

34,991 

253,557 

36,197 

1,930,761 

15,245,301 

250,489 
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SUMMART OF COEFFICIENTS 


Section A 

iSeefton B 

iS>dion C 

W - 164 

2(x) - 851 

S(y) - 1,014 

S(x*) = 5,181 

Z(xy) .= 5,695 
r(r?,/JV^) - 6,568 54 

2(y - y)* - 428 512 

2(**) - 34,991 
l(x*) - 253,557 

2(x*y) - 36,197 

2(x‘) - 1,930,751 
2(x») - 15,245,301 
2(x»y) = 260,489 


The solution of these equations is earned out as in Table 69, the method hemg 
identical with that described m Chapter Vlll for partial regression and correlation 
coefficients. Note the * ‘check sum” column, which is used for checking the calcula- 
tions as you proceed, and in addition the “chock line” just below the "reverse,” that 
gives a complete check on all the calculations including those in the reverse. In 
Table 69 the check line is obtained as follows. 

• 161 X 3 244,175 + 851 X 0 566,340 = 1014 

It IS merely a substitution of the statistics cq and ci in the first equation of (6) 

At the foot of Table 69 we have the anal>8i8 of variance for testing the significance 
of the degree of freedom due to the regression straight line i?o =* S(y — y)^ is 
obtained from Table 68, using the equality 

SCv - 2(!/*) - ^ 

— (Fi — y)“ IS then obtained fiom the solution of the normal equations by multiplying 
the figure in line 5, column 1 (5,1), by the square of the figure in line 6, column 
A'(6,A)“ The djffcience ih the sum of squares 2(3/ ~ l"i)" = Ai, and may be taken 
to represent the erioi of iegre.ssioii and is therefore appropriate for testing the sig- 
nificance of the vaiiance due to the regression line In the example, we find that 
the regression is decidedly sigmficant but we proceed to the second stage in order to 
determine whether or not greater accuracy can be obtained 

Proceeding to the fitting of a polynomial of the form = cd + cii + P 22 :*, w'e 
write the normal equations 

w'co + 2(a-)ci -f 2 (x*)c 2 = 2(i/) 

2(a:)€iB -f 2(x*)ci + 2 (x®)c 2 = 2(rt') (7) 

2(x^)co + 2(x*)ci + 2 (r^)c 2 = 2(xV) 

and the necessary data for solving the equations are obtained as in section B of Table 
68 The solution of the equations is performed according to Table 70, and note that in 
this table columns (0) and (1) can be copied directly from Table 69, and column K 
can be copied as far as line 6 The reverse and the check line are calculated in the 
usual wav For the analysis of variance Ri is brought forward from Table 69, and 
2(Ki — F 2 )^ is calculated by multiplying (10,2) by (11, where the numbers in 
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TABLE 60 

Solution of Normal Equations for FimNO a Straiobt Linr 


Line 0 1 E Sum 


1 

164 

861 

1014 

2029 

2 


-6.180,024 

-6.182,927 

-12.37,195 

3 


6181 

6696 

11,727 

4 


-4416.8604 

6261.6703 

-10,528.630 

5 


766.1406 

433.3297 

1,198.470 

6 



-0.666,340 

-1.666,340 

Cl- +0 566,340 1 1 


-hO.666,340 

+0.666,340 


eg >+3 244,176 ^ 2 

+3.244,176 

-2.938,762 

+6.182,927 


Check 

632.0447 

+481 9553 

-1014 




^ (flq.) 

DF 

Variance 

F 

6% Point of F 

JZg — 2!(y — j?)* 

428.512 

163 




(6,1) X 

245.412 

1 

245 4 

225 

3 90 

Ri~S(v - Ki)* 

183 100 

162 

1 130 




the brackets correspond to line and column respectively. The difference between 
the two sums of squares is Es, which can now be taken to represent the error of 
regression. In the example we find that the variance due to the additional degree 
of freedom used in calculating the secondnlegree curve is quite significant, so we can 
eondude that a real gain in precision has been made. 

If the method of procedure up to this point has been thoroughly understood it 
will be found that the fitting of additional statistics can be carried forward without 
difficulty. The work involved in fitting to the third degree in the present example 
has been performed in Table 71. Note that the columns 0 , 1, and 2, can be copied 
directly from previous calculations and that column K can be copied as Tar as line 11. 

The analysis of variance indicates that the variance due to the additional degree 
of freedom used in fitting a polynomial of the third degree is insignificant. It is, in 
fact, less than the variance due to error of regression The conclusion is that the 
third-degree curve, although it fits the data satisfactorily, is less useful for predicting 
loaf volume from protein than the second-degree curve. In making use of another 
degree of freedom to determine a new regression function, precision has actually been 
lost. 
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TABLE 71 

Solution or Normal EIquations for Fitting a Third-Degrkb Polynomial 
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*’(sq ) DP Variance F S% Point 

ft . 137 235 iii 

(16,3) X (17,ii:)’‘ ■= 24,197.4 X (0.003628)» 0 301 1 0 3010 

ft 136 934 I 160 0 8658 ’ 
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6. Fitting Logarithmic Curves. The procedure is best illustrated by 
means of an example. 

Example 46. The data given in Table 72, and presented graphically in Fig 13, 
were obtained in a study by Geddes (4), of the effect of time of heating on the baking 
quality of wheat flour. 



Fkj 13 —Relation between time of heating and baking quality of wheat flour. 

Fiom an examination of iMg 13 it is obvious that a straight line cannot give n good 
fit to the results It is also obvious by inspection that a polynomial cannot be 
expected to give a good fit as the curve tends to flatten out and run parallel to the 


TABLE 72 

Influence of the Time of Heating at 1V0° F on the Baking Quality of 

Straighi Grade Flour 

Baking Quality 

Time in Hours Single Feature Estimate 


0 25 

93 

0 50 

71 

0 75 

63 

1 0 

54 

1 6 

43 

2 0 

38 

3 0 

29 

4 0 

26 

6 0 

22 

8 0 

20 


zero axes at both ends. From x » 0 to x » 4, the curve might be fitted fairly well 
by a second-degree polynomial, but as x increases from that point, the curve flattens 
out and runs almost parallel to the x axis. This is typical of logarithmic curves and 
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decidedly not typical of polynomials. We decide therefore that a logarithmic curve 
will give the best fit. 

The next step is to examine the three principal types of logarithmic curves, aa 
given on page 221 , and make a preliminary determination of their goodness of fit to 
the results by plotting the three pairs of variables, y and log y, log y and x, log y and 
log X, against each other in a rough graph and noting which of the three give points 
that fall most nearly in a straight line. As illustrated in Fig. 14, the set of points 
falling most nearly in a straight line are those given by log y and log x, so we proceed 
to fit a curve of the type log F ■■ cn + ci log ^ 

The calculations, using log y and log x as variables, are exactly the same as 
in fitting a straight line These are given in Tables 73 and 74, together with the 
analysis of variance to determine the significance of the fit of the regression line. 



Fig. 14.— Result of preliminary test to determine the logarithmic equation giving 
the best fit to the data of Table 72. 

Note that the goodness of fit is determined on the basis of the logarithms of y and F, 
and not on the basis of the actual values. Thus the error of regression is given by 
2)Qog y — log F}^. This can be taken as a general rule, i e., that when the regression 
equation gives logarithmic values, the test of goodness of fit must be in terms of the 
logarithms estimated. It arises from the fact that logarithms express the relative 
differences between numbers and not their absolute differences. With two numbers 
such as a and b, their absolute difference is a — 6, but log a — log b is log a/b, and if a 
and b are variables and a given percentage increase in a results in a similar percentage 
increase in b, log a/b is constant and the relation between the logarithms can be ex- 
pressed by a straight-line equation. To test this fact it is essential that we deal with 
logarithms throughout and not with actual values. 

For graphical purposes it is suitable to express the results of fitting a logarithmic 
equation as in Fig. 15, where the actual values of x are plotted against the anti- 
logarithms of log F, and a smooth curve drawn through the points. The small 
circles in Fig 15 represent the original values of y and x. 
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log Y = Co + Cl log X, 
to the data of Table 72 


TABLE 73 

CAlCULiATIOOr OF CoEFFICIBMTB FOR THE Cl/RVE LoO F =• O) -f Cl log X 


X 

Time 
in Hours 

V 

Baking 

Quality 

Xl 

= Log X 

■= 

yi 

Log y 

Ix)g r 

Y 

0 25 

93 

-0 6021 

1 

9685 

1 

9937 

98 6 

0 50 

71 

-0 3010 

1 

8513 

1 

8528 

71 2 

0 76 

63 

-0 1246 

1 

7993 

1 

7704 

58 g 

1 0 

54 

0 0000 

1 

7324 

1 

7120 

51 5 

1.6, 

43 

0 1761 

1 

6355 

1 

6296 

42.6 

2 o’ 

38 

0 3010 

1 

6798 

1 

6711 

37.2 

3.0 

29 

0 4771 

1 

4624 

1 

4887 

30 8 

4 0 

26 

0 6021 

1 

4150 

1 

4302 

26 9 

6 0 

22 

0 7782 

1 

3424 

1 

3478 

22 3 

5.0 

20 

0.9031 

1 

3010 

1 

2893 

19 5 


S(xi) - 2.209,600 
SCxi*) - 2.601,671 

2(vi)*' > 6 086,600 

2(;7iVi) - 0.365,642,8 


2(yi)» > 4 169,362 

|2(»i)lVl0 - 3 703,463 


2(v - g)* -= 0.465,909 > Ro 


• I/I c(Hl«d by subtrsctins 1 . 
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TABLr: 74 


Calculation ok Statistics and Tfst of Goodnicbs of Fit fok the Curve 
l.og Y = r<» 4 ri log x 


Line 0 \ K S 



1 

iO 

>> V!0Ub j 

6 0.S50 

18 2952 


o 

-1 0 

- Oi'L'Our) 1 

0 00856 

• 1 S2952 


S 


‘JtiOltw i 

! tl 

! 5 1CU91 


i 


-0 4ssi?:j 

1 311,1)74 

! - 1 04251 


5 


•JlKMl ‘ 

' OOSO,031 

1 1 12440 


(> 


- 1 0(HM) 1 

1 0 40797 

1 

Cl = -0 46707 

1 


- 0 lti797 

0 10797 

1 

1 

CO = +0 71190 

2 

f 0 71 nil. 

1 0 KW IO • 

1 0 (i0S5() 

1 

1 

Check Line 


7 11% 

- 1 o:no j 

- li osr.b 



#S'{sq ) I I> F , \ (' 


F 


1*, Point 


Ro ^ 0 i 0 

(5,1) X - 0 4ti2,.W. I I I 0 HV.'S 

Wi = 0(X).1,071 I S (IlKHUS! 


I'lW 


11 


Equation log 1' -- 1 71 jyC) -- 0 4(»7tr; log j 


6. Fisher’s Summation Metliod of Fitting, Polynomials. When the 
y values arc, or can hi' as'^iuiietl to Lc, of (Mjual ight and arc given for 
equal intervals* of j*, the method of filling ]>» lyiioiuials develoix^d by 
R. A. Fisher provides a MTy decuKd short (ut lioin the actuid to the 
theoretical polynomial values ''I'he luithiiictical labor is likewise easy 
as it consists largely of a proce.'i.s of continuous {5iunuiali(,ii The pro- 
cedure will be illustrated by an (‘xuinplc. 

A summary of foimulae for fitting polyiioniuds i'^ given below, and m 
Tables 79, 80, and 81 tlK* constant faciors in the formulae have been 
calculated for w = 5 to 20 and r 0 to 0, wiiere r K'presents the degree 
of fitting. 

* Professor Fisher has now developed this method fur application to the case 
wherein the y values are of um^quaJ weight 8ee the refeieuces at the end of this 

chapter. 
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Summary of Formulae for Fitting Pol'inomialb by the Summation Method 

1 jSi, 1S2, #Sa, Sit St, -Sr + 1 (by suniTiiation) 

1 „ 

2 a - Si a' = a 

n 


^ n(n -f 1) 


^ n(n -f 1) (n + 2) 


^ n(n H' 1) • ■ (n H- 3) 


^ n(n -1-1) (n 4 4) * ® 

720 , 

n(n -j- 1) (r* -1 51 


6 ' = fl - fc 

r' - a — + 2 r 

fi' =• a ~ 6^ + lOf - 5(i 

f' - a - 10b ■+ 30c ~ 3.W H 14f 

]’ II - 15/) 4- 70c - I40d -h 126e - 42/, 


1 2 3 

n(n -h 1 ) (» ‘ ^ 


wlu'ie iLe rule for the formation of the coeffi- 
cients in to multiply HurcesBively by 

}1 (f - 1) (r 2 ) (r - 2) (r H- 3) 
" I 2 " * 2 3 ' 3 4 

II ml so <m, until the Beries terminates. 


> 1 - -I IX (a' -f 3// 4- 5r' 4 7d' 4 9f' 4- 11/') 1 

D’ri = - 7 {h' 4 5r- 4 I Id' -! 30e' f 65/') 

(n - 1) 

(VO 

n^Yi = + - -- - 7 - (r' ) 7,1' + 27f- -f 77/') 

(n — 1 ) - 2) 


(n - 1) (« - 2) (ti — 3) 


(./' H <>■ f A^^f') 


D*Yi = + - 4 > («’' + 1 V') 

(n — 1) (r* — 2) (» “ 4) 


Coefficients 
3 5 7 9 11 

1 5 14 30 55 

1 7 27 77 

1 0 44 

1 11 


D®ri - 


332,040 


(n - 5) 


Each formula is seen to be composed of two parts that are best cal- 
culated separately. F‘or the component on the right Fisher gives the 
coefficients for fitting curves to the tenth degree. They are reproduced 
here for fitting up to the filth degree. The factors on the left are of 
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alternate positive and negative signs and in generalised form are as 
follows: 

-2.3 3.4 6 -4 6 6 7 (r + 1) (r + 2) • -(2r-f 1) 

n - r (n - 1) (n - 2) * (n - 1) (n - 2) (n - 3) ' ' '*“(11 - 1) (n - 2)- ■ (n- 4) 

4. Polynomial values Yi Ka, etc., by process of summation.' 

Bumple 46. The y values in Table 76 represent the percentages of cars of 
smutty wheat graded at Winnipeg, Manitoba, for the years 1926 to 1933 (6). The 
X values are therefore years and can be replaced by the numerals 1 to 9. We shall 
use these data in order to show the procedure of fitting a curve of the fifth degree 
Such a curve would probably be of very little practical value for analyzing data of 
this kind but it is quite suitable as a numerical example. Summing the y values 
from top to bottom we write down the sum showing on the machine after each value 
is added. This process is repeated in succeeding columns, the sums of the columns 
being designated Si, S 2 , etc , and if we are fitting a curve to the fifth degree we must 
go as far as S^. At this point the summations must be very carefully checked. 
This is accomplished simply by adding all the columns and noting that the last 
figure in any one column must correspond with the sum of the column on the left. 

The second step is to calculate values that are denoted by the letters a, b, c, d, e, 
/, and from these obtain a\ b', c\ d\ e', and /'. The formulae for these calculations 


are given on page 235. In our example we have 



a •= 53.1/0 ° 5.900,000 

tt' - 

6.900,000 

b 263.3/46 = 6.628,880 

b' - 

0.271,111 

e = 790.8/165 - 4 792,727 

c' =- 

1.401,213 

d - 2020.8/495 - 4.082,424 

d' -- 

0.368,184 

a - 4677.5/1287 - 3.656,721 

e' - 

0.302,174 

/ •> 0643.6/3003 >« 3.178,022 


0.088,117 


The third step is the calculation of Pi the polynomial value of y corresponding 
to X « 9, and five other values known as the first, second, third, fourth, and fifth 
differences. From Pi and the differences represented by the symbols 

D‘ri, D*ri, n^Yi, d^Yi, D»rx 

the polynomial values are built up by a prooeae of eummation as liluetrated in 
Table 76. For Y and the differenoee we get 

Yi - 1.000,000 X 0.888,833 - 0.888,833 

Z>'Fi -- 0.750,000 X 2.162,126 -- 1.621,504 

E^Yi - 1.071,428 X 11.035,206 - 11.823,420 

JJ^Yi - - 2.500,000 X 6.238,630 - - 15.506,325 

D^Yi - 0.000,000 X 1 271,461 - 11.443,140 

E^Yi - -40.600,000 X 0.088,117 - - 4.361,702 

* If neceeaary the actual equation may be written. Details of the ealeulationa are 
given by Snedeoor in "Statistical Methoda." 
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The Bummation process as illustrated in Table 76 is started in the lower right- 
hand comer. Beginning with D^Yi we add successively the value of D^Yi. The 
other columns are then built up merely by starting with the first figure at the bottom 
and adding the figures in the same row in the column to the righl. The values in 
the last column on the left are the calculated polynomial values of y. Note that in 
the second column only five values are required but we require one more in each 
column as we proceed to the left and also that if only two decimal places are required 
for the polynomial values the number of decimal places are reduced by one for each 
column after the second A final check on all the work following the calculation 
of Si, St, * • •St is to add the last column This should give us S, the total for all the 
values of y. 

The summation method is particularly well adapted to fitting by successive 
stages and to the application of the analysis of vanance at each stage. Assuming at 
the outset that fitting will probably be carried to the fifth degree we first calculate 
^ 1 , Sf ‘ ‘St as in Table 75 and the constants a', h', c\ d\ t\ For each stage of 
fitting «e require only Yi and the corresponding differences. If desirable we can 
determine the significance of each degree of freedom used in fitting before we go to 
the trouble of actually calculating the polynomial values and in this way save our- 
selves the labor of calculations that are not going to be of any value. The formulae 
for the sums of squares represented by each additional degree of freedom used in 
fitting are as follows: 

Degree of 

Fitting (f) Sum of Squares 


0 

Sl/n 

na'* (Represents fitting of the mean) 

1 

S(» - Fi)* 

n(n + l) 

(n - 1) 

2 

Sd’i - Yi)* 

,«(n + l){n + 2) „ 

(» - 1) (n - 2) 

3 

S(V* - Yi)* 

n(« + d • • (n + 3) , 

(n — 1) (n — 2) (n — 3) 

4 

Z{Yt - Yi)* 

n(n + !)■• (n + 4) ,, 

(n-l)(»-2) •(n-4)* 

5 

2(^4 - y,)* 

n(n + l)- (n + 6) , 

(n-l)(n-2) • (n-5)-' 

r 

2(y,-i-yr)* 

/o . .s n(ii + !)• --(n + r) , ^ 

(2r+l) (constant)' 

(n - 1) (n - 2) ■ (n - r) 


For the exanr pie that has already been fitted to the fifth degree the sums of squares 
and corresponding analyses of variance are given in Table 77. \fter fitting to the 
second degree there is no further gam in precision, consequently in actual practise 
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we would proceed direct to the calculation of the polynomial values for a second- 
degree curve. This calculation is given at the foot of Table 77. 


TABLE 76 

Calculationt of .s’l, A2, 1 % Si , and A’c t'on KirciNa a Polynomial op the 
Fifth Deghee by tue Si^mmation Method 


* y 




■M 





1 

2.2 


2 2 

2 2 

2 2 

2 2 

2 

1.2 

Hn 

5 6 

7 S 

10 0 

12 2 

3 

2 6 

6 0 

11 6 

10 4 

20 4 

41.6 

4 

5 5 

11 5 

23 1 

42 5 

71 9 

113 6 

6 

16.5 

28 0 

51 1 

93 6 

165 5 

279 0 

6 

17 0 

45 0 

96 1 

ISO 7 

355 2 

634 2 

7 

6 5 

51 5 

147 tt 

337 3 

(>02 5 

1326 7 

8 

1 1 

52 6 

200 2 

537 5 

1230 0 

2550 7 

9 

■IH 

53 1 

253 3 

71KJ S 

2020 8 

4577 5 


53 1 

253 3 

79t) S 

ft 120 S 

1577 5 

9513 (> 



= <'''2 

- A’j 

1 

--- ^1 

- Sb 

==>Se 


TAiil.E 76 

CAiXTrLAIMiN OF Va»«'Km 


1 

2 34 





2 

0 87 

1 467 




3 

2 06 

- 1 190 

2 2115 7 



4 

7 91 

- 5 845 

4 655,4 

" 1 99S,50 


5 

14 40 

- 6 495 

0 61<t,» 

4 005,52 

-- 6 001,019 

6 

15 90 

- 1 497 

- 4 997,9 

b 617,75 

-- 1 612,227 

7 

9 47 

6 429 

- 7 926,1 

2 02 s, 18 

2 710,565 

8 

-0 73 

10 202 

- 3 773,9 

t 153,18 

7 081,357 

9 

0 889 

- 1 6216 

11 K23,43 

-15 500,325 

U 443,140 


Example 47. The whole piocesa of fitting by succcasive stages may be carried 
out m tabular form as in Talilc 7S Tht ihita me for the i elation between pH and 
the activity of the enzyme aspaiiiginase (6) Note that, throe columns are required 
for fitting to the first degree and therraftcr each additional column provides tlie 
data for fitting one additional constant Liiit’s 14 and 15 determine the degree to 
which the curve should be fitted. In the example it is obvious that the fitting should 
be carried to the fourth degree; consequent l 3 \ the reinamdcr of the woik applies to 
a fourth-degree curve only. 
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TABLE 77 

Analyses of Varianpe -- SittviFiCAXcK op Deorbk.s op Freedom Used in 
Fiiting to the Fifth Deoree in Successive Stages 


DoKTCo 

of 

Fittinp 


Sums of 
Sipiarcs 

Di'piCfS 

of 

I’looiLini 

Vananro 

F 

b% Point 

1 


334 *30 

8 





BeKn\ssi(»ii 

2 4S 

1 

2 48 




Ei ror 

33‘2 IS 

7 

47 50 



o 

KeK^essiou 

17,i 55 

1 

173 () 

5 55 

5 90 


Krjoi 

15S 03 

(1 

20 10 



3 

Ri'j'rcrtNi Ml 

31 75 

1 

31 75 

1 25 

(> 61 


Error 

rJ7 i.s 

i *'* 

25 4 4 



I 

Rei^iession 

75 51 

1 

75 54 

5 S5 

7 71 


El 1 or 

5i tU 

1 

12 ‘M 



5 

Rc^rio'^siou 

27 Is 

1 

!7 IS 

3 41 

10 13 


Error 

• 

J1 Ih 

1 

1 

H 

1 

5 t)51 




} 


i 


fr)'joo,ono : :s 




I) ) 1 

D'^Vy 


0 75 tO" 

ri.ni - 5 

1 |'.H,2.4) 

- 

5 05l.2ir) 

! 071, »2> '* 

1 40I.‘M J 



- 1 501,29! 




Po1\noniml 


j* 



\.»1UC8 


1 



■ 1 92 


2 


5 15S 

3 54 


3 


3 957 

7 50 


4 


2 155 

9 05 


5 


0 954 

10 90 


6 


0 517 

10 36 


7 


2 019 

S 31 


8 


3 550 

4 76 


9 

I 50 1.30 

5 0512 

- 0 293 



53 1 


Total --- 




TABLE 78 — Complete Tabular Method for Fitting bt tee Summation Method in Successive Stages 

Degree of Fitting (r) 
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^CM CO 460*0 CO 
f-H f>-i CM 

^,218.6 

27,132 

2.330,038 

+0.008,954 

0.000,080,17 

1,391.38 

0.1115 

0.1115 

1.2568 

8 

0.1115 

0.1571 

CMCMor^ocoaocqcor»or>;cqco 
o>-««or'CMoiQr^or^»oocdcM 
•^kocMoo^ooeoxocMiO 
i-iCMiOOCO*OQOcOQO 
.-iCM CO»Ot^ 

23603.7 

8568 

.2 754,867 
+0.173,236 
0.030,010,7 
539.245 

16.1831 

161831 

1.3683 

9 

16.18 

0.1520 

10.6 

5.12 

+0.173,236 

+0.881,119 

+0.152,642 

CO 

CMpQOr<^coco*oaqc«'^cor^oo 
0 ^ CO CM rxil —4* « CM t'o* kO CM o> 
1-^COI>»0«DCO»0’^OCOCM 
»-^CM'M<<OOSCQt^CM 
CM 

7852 6 

2380 

3 299,412 
-0.120,634 

0 014,552,6 
233.007 

3.3909 

3.3909 

17.5514 

10 

3.391 

1.755 

1.93 

4.96 

+ 1.438,490 
-0 489,510 
-0.704,155 

CM 

C‘JOOaoaiqDOCMCOpi>-p'*ihCM»^ 

oocMX^coTiiic*6'^ajcior-4P 

CM^t^^OCMQOiOCM^ 
^ CM CM CO 

2229 0 

560 

3 980,357 
-1.292,144 

1 669.636 

107 6923 

179 8069 

179.8069 

20.9423 

11 

179.8 
- 1.904 

94.4 

484 

-i-2 540,790 
+0 384,615 
+0.977,226 

- 

1 

cMcco^t^'i<CM.-ipQqcM»ooqp 
OOCM40CMi-ii-^',-(Ood*6odcX)QO i 
»-• CM CO ^ *0 *0 «0 40 40 S j 

496.1 

105 

4.724,; 62 
+0.196,666 
0.038,677,5 
48.4615 

1.8744 

1.8744 

200.7492 

12 

1.874 

16.73 

-2 755,850 
-0.461,538 
+1 271,929 

j 

Ci'^’^»Hpt>;OOplOCM'^COCO^ 

OOF-«^46a6a>oioG646cooo 

68.9 

14 

4.921,428 

+4.921,428 

24.220,454 

14.0 

339.0864 

541.7100 

339 0864 

202 6236 

1 0 234,608 - 
+ 1.0 

-0.234,608 


•-HeMco'*4<i040i^aooOi-ieMco'^ 

^ vH ^ 

K : ; g • • • • ; -S : 

i i / 1 • . .i|i 

IiR'* QCO*^** 

2. S 

bsS s 

^CMeO^■Q«t»QOOO^CM^9'l4<aOCOr^QO 

^4 
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(1) JSr’^i values entered as columns are summated. 

(2) Divisor for Sr^i values, taken from Table 79. 

(3) Division of line 1 by line 2 gives the constants a, 6, c, d, . . . . 

(4) The constants a', b*, c\ d\.. are calculated from a, b, c, d, .as indicated in 
summary on page 236. 

(5) Squares of a', c\ d\ . . 

(6) Factor taken from Table 80. 

(7) Line 5 multiplied ^y line 6 gives the sum of squares S(Fr*i — Fr)^ repre- 
sented by 1 DF. For each DF utilized in fitting this is the reduction in the sum of 
squares due to error of regression. 

(8) Enter m first column. 

(9) Repeal; 2)(Fr-i - F,)* values 

(10) Subtracting 9 from 8 in the first column gives the remainder in line 10. Then 
subtract the values in line 9 successively, putting down the remainders in line 10. 

(11) The DF for error of regression are entered here. The DF for the sums of 
squares in line 9 is 1 in each case so that they do not need to be entered. 

(12) Line 9 repeated, reducing to 4-figure accuracy. 

(13) Line 10 divided by line 11. 

(14) F - vi/vi. • 

(15) Enter 5% points from Table 96 

(16) Calculate as in section 3 of summary of formulae. 

(17) Enter factors from Table 81. 

(18) Line 16 multiplied by line 17. 


Calculation of Polynomial Values fob Fourth-Degree Curve 


1 

0.2748 




2 

01691 

0.106,661 



3 


-1.621,180 

1.026,831 


4 


-2,326.746 

0 804,666 

0.822,265 

6 

6.4768 


0.134,943 

0.669,623 

6 

8 5564 

-2.078,661 

-0.382,038 

0 516,981 

7 

• 9.8877 

-1.332,274 

-0.746,377 

0 364,339 

8 

10.2619 

-0.374,200 

-0.958,074 

0 211,697 

9 

9.6190 

0.642,929 

-1.017,129 

0.059,055 

10 

8.0625 

1.666,471 

-0.923,542 

-0.093,687 

11 

6.8087 

2.243,784 

-0.677,313 

-0.246,229 

12 

3.2866 

2 522,226 

-0 278,442 

-0.398,871 

13 

1.0373 

2.249,166 

0.273,071 

-0.661,613 

14 

-0.234,608 

1 271,929 

0 977,226 

-0.704,166 


7. Ezercisea. 

1. Calculate the oorrdation '*atio for the data of Table 67, Chapter XIII, and by 
means of the analysis of variance test the significance of the variance for the moans 
of the arrays. 
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TABLE 79 


jLll 

’ r’2 -4 1) 


?OK UhB IN^ Caltul^tion OF a, h, c, rf, e, /■ 
DcRrce of Fittiikg (r) 


n 


1 

0 



1 

2 

3 

4 

5 

6 

o 


5 

K5 

35 

70 

126 

210 

330 

tj 



21 

50 

121) 

252 

462 

792 

t 

i' ' 

2S 

SI 

210 

462 

021 

1,716 

h 

ij « 

3n 

120 

330 

702 

1,716 

3,432 


! 

45 

105 

405 

1.2S7 

;i,n()3 

6.435 

in 

i 

10 

55 

220 

715 

2,002 

5,005 

11,440 

u 

j 

n 

ijr> 

2sr» 

1001 

3,003 

S.OOS 

19,448 

i*' 


1 7 s 

:;hi 

l.V,5 

4,31)S 1 

1 1 2.376 1 

31,824 

rt 

1 

1 

01 i 

155 

1S20 i 

0.1 SH ! 

! IS, .56 4 ; 

50,3S8 

1 1 

!i 1 1 

i(»r> 

.500 

23H0 

S,5f)S , 

27,132 1 

77, .520 

1.5 

1 

1 5 

r>i» 

OMI 

30<'.0 

1 i..r,2s j 

1 3S.760 1 

1 110,280 

in 

1 

1 

in 

; VMi 

Sltl 

3S7n 

15.504 ; 

51.261 

170,514 

17 

1 

17 

i 15:1 

non 

l‘<15 

20,3 in 

71,613 

245,157 



15 

1 171 

11 lO 

' 50.^5 

20.3,3 1 j 

100,617 

310,104 

in 

1 

10 

mo 

J.TIO j 

7315 

33.1)10 1 

i:n.5'«) ; 

180,700 

»n 

' 

/() 

'Ok 

15*0 j 

sv\'5 

12,.501 i 

i 1 

1 77,10<) 

1 

1 

057.800 


'f' TM.K -sO 

fMr/ I 1 ) (// -f -M -f U I , 

("/ I l> ■ - - “I »<H« < !>!' Si Oh Sut'AUBJ 

M ?)j 

\ i»l 1 ittnii: ir) 

. ‘ I ' ' ‘ " 1 ' ' "7 ” . ‘ ' “ 


ti 

0 

1 

•> 

” 

1 3 

1 

1 

5 

_ _ 

0 

;i ; 

5 0 

22 .501*0 

S7 soon 

490 000 

.5070 000 



ii 

0 0 

1 2,5 ?OUO 

.SI 0000 

3.52 SOO 

•.!26S 000 

30,492 00 


‘i 1 

7 0 

1 ‘JS 0000 

SI 0000 

*2n 1 0(K) 

13Sf.(K)0 

; 10,101 00' 156,156 00 

s i' 

*1 

?% 0 

i 30 S57I 

S.5 7143 

•.'61 (UMI 

1018 2Sl) 

5,393 1 4 

44,616.00 

n 0 

.13 7501* 

SS .3928 

2 17 .'>00 

827 357 

3, . 5.39 25 

20,913 75 

10 

U) 0 

30 0007 

91 or, 1)7 

•23S .33.1 

715 000 

2,621 67 

12,303 33 

" 1! 

11 0 

,39 tiOOO 

‘t,5 3331 

233 ,507 

013 5(K) 

2,097.^3 

8,427 47 

12 ' 

12 0 

12 5154 

99 2"/ 27 

•231 0)31) 

595 030 

1,768 00 

6,208 36 

1.3 !■ 

13 0 

j 15 .5000 

103 1091 

231 o;io 

502 545 

1,547 00 

4,962.45 

11 ! 

11 n 

1 IS 1015 

! 107 0923 

2.13 007 

539 ‘245 

1,391 38 

4,110 91 

15 ! 

15 0 

! 51 12S0 

112 0879 

235 .385 

5‘2’2 737 

1,277 80 

3,523.64 


U) 0 

1 54 1000 

110 57U 

2 . 3 s 523 

511 121 

1,192 62 

3,100.80 

i! 

17 0 

57 .37.50 

121 12.50 

242 250 

503.135 

1,127.39 

2,785.88 

\< ‘i 

IS 0 

(‘0 3.529 

125 7353 

210 141 

497 912 

1,076.68 

2,544 SS 

m j; 

19 0 

b3 .3333 

130 .3922 

•251 005 

494 838 

1,036 80 

2,356.37 

20 {! 

20 0 

00 3 1 . 5 s 

135 0S77 

255 872 

493 467 

1 

J ,005.21 

2,206 24 
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TABLE m 


db(r-f l)(f +2) - (2r4-l) 
(n-l)(n-2)...(«-r) 


FOR Caltihation ok Vi * Dikkekknoes'' 


Uogiee of FiMinp: (r) 


• 

Tl 

0 

1 

2 

3 

1 

I 

1 

0 


+ 

- 

4 

— 

•T 

— 

-H 

5 

1 

1 5 

5 0 

35 0 

O.SU 0 



6 

1 

1 2 

3.0 

U*» 

120 0 

2772 0 


7 

1 

1 u 

2.0 

7 0 

42 0 

4(iJ.'i 

12 IH2 0 

8 

I 

O.s:,7I,4J80 

1.4235,7113 

4 0 

IS 0 

13. 0 , 

1 1 /JOO 

9 

1 

0 75 

1 07l4.2Sfi7 

J 5 

0 1) 

10 5 1 

I2i) 0 

10 

1 

0 firi(i6,6ii(i7 

Ohi:i3,ri U 

1 .01)05, (>1)51' 

oO 

/J (•■ 

> 13 0 

11 

1 

0 60 

0 

1 ‘000,iiG07 

3 0 ; 

1 

.‘>7 J 

12 

1 

0.545 1.54.>4 

0 .54. 'i4,. 154 

(1.8 

1 -Ml'lO.SOyl 


2h-' 

13 

1 

0.50 i 

0 1515,1.545 

0 o>o:i,i»3i>* 

1 2727 25 27 

• i 

l.<i' 

14 

1 

0,4Gir>,3S4G 

0.3M0. 1.538 

0 1X05, HM'» 

1' X.>l J,’.8XX 


1 7 0 

15 

1 

U428o,7i4J 

0 dJUO.'y 033 

0 3840,1.. IS 

0 fi2'U,7(I(* J 

1 '.Ml ,1 i 

I 4 0 

16 

1 

0.40 

U J857.1 IJ8 

0 307o.'»i31 j 

0 JM5..1S»h 1 

1 (• •( ’.)!■ Hi 

1 *’ ’ 

17 

1 

0 .'175 

0 25 

0 2.5 

l‘3*Gl,.)3s.5 1 

1 U i.MU 1 .<.s 

1 

18 

1 

0 352(»,tllh i 

n2205..SV4 

0 2(i5S,W2.i . 1 

1 0 2017,0 ,SS 1 

1 1/ 1 l/'l,. ! 1 

1 

19 

1 

0 33J3,.n;j3 

0 1005,78*3 

0 in 1 

U1‘05S,S2';5 1 

1 ,.'■•1' 

'1 -il. J . )S.' 

20 

1 

1 

0.3iri7,*i‘H7 

0 17ol,3S(>0 

0 1 *41,7SS1 ! U U.i.5,'l8.-() ' 
1 1 

! 0 M 

1 

" 2Uf» 


2. For tht* JAliovsiiig etiijauoiiu, v'jtK'ul'jo' tl.i v iLic. >>J V' li*r j I . 
aiul plot tho ciiiv(‘8 dll giaplt t'iMHi 


(a) 

}' •- rj>. 1- 0 hi a- 

(b) 

1' --- ? ns f t io? I- 

(e) 

!<)(> r -- 0 ‘iPh (- 0 oas ,i 

W) • 

r - 0 .'1'! t- 0 oi.y J<)i' 


Dcbrrilx* the effo<‘r of ihc log:uiil»mic liiiiirtfuriiutl .uii )f (n) o io i.| t.j. 

tioiis (b) and (c) 

3 . Using the data given in 'Fable 85 , deturuiino the lyp-* of Jogarillirii>r “lii \ v ‘liaL 

should be fitted to the data Having soleelcd the tyjie of earve jinA-ccd anli thi‘ 
fitting as ill Tables 71-1 ami 71 . J'^repiro two gniphs, one sliowni*' llu b* ol ihi* 
straight-line loganthiiiir equation to the logaiilhrns of v and an(Xh''r '^h rljt* 

curve for the aotiial vuluos of >’ estimated from the legrosMon e<tuatioii SV 

may be uswi for a similar exercise 

4. Tablets give^ the values of y, V|,*, and Tyj from a eon elation burfnee foi tin- 
area and head hmgth of 600 bull 8iMMiiiAtoz<»ii, Isu (7) The three eolumn*^ are 
similar to the first three columns of Tablf* 68 and provide all the data necerf'^aly 
for calculating polynomial regression equations I*'ind the regression equation thnt 
gives the beat fit to the data. Then raleulaie the Y valued nnd const met a gra)>h 
similar to Fig. 15, showing the means of the arrays and the regression line 
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4. Ufliog the data for x and y given bdow, determine the goodnen of fit of curvee 
up to the eixih degree. Select the curve to which the data riiould be fitt^, and 
proceed accordin^dy to the calculation of the polynomial values. Graph your results. 


X I 2 3 4 5 6 7 8 9 10 11 12 13 14 

y 12.6 13.8 14.1 13.9 12.3 7.2 4.8 2.8 2.4 2.1 3.7 6.3 7.8 8.3 


6 . In economic analysis, methods of curve fitting are very frequently utilised* 
in order to study secular trend in a time series. Secular trend means the smooth 
long-term movement of a series of statistical values and is entirdy distinct from 
seasonal and cyclical fluctuations. Cyclical fluctuations are not as periodical as the 
seasonal ones but as a general rule have sufficient regularity to show definite swings 
above and below the normal through periods of depression and prosperity. Curve 
fitting may, on the one hand, be used to measure the secular trend of a statistical 
series, and, on the other hand, using the fitted curve as a normal, we can plot the 
deviations from the normal in such a way as to bring out the characteristics of 
cyclical fluctuations. 

Take the data given in Table 84 of the bank clearings in New York City for the 
years 1860 to 1923 and combining them in 4 year groups obtain 16 points to which a 
curve may be fitted. Determine the best-fitting polynomial and graph your results 
on a large sheet of graph paper giving the 16 calculated values and the actual bank 
clearings for individual years. Measure off the deviations of the values for individual 
years from a smooth curve drawn through the 16 calculated points, and graph these 
deviations on another sheet showing them as deviations from a straight horizontal 
line. 

TABLE 82 

Hbat of Hydration in Calories and Water Imbibed per Gram op Flour 


Cc. 

Heat of 

Water Imbibed 

Hydration 

0.012 

2.3 

0.026 

6 7 

0 039 

7 4 

0 049 

9 2 

0.064 

10.7 

0 073 

12.4 

0.091 

14.6 

0.099 

15.1 

0.123 

16.8 

0.146 

17 8 
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TABLE 83 

Data from Corrblation Subfacb for Area (y) and Head Lbnoth (z) of 

£00 Bull Sperms 



Frequency 



of y for 

Totals for 


X Arrays 

y Arrays 

y 


Ty. 

1 

2 

6 

2 

0 

24 

3 

7 

63 

4 

7 

247 

5 

14 

618 

6 

12 

939 

7 

22 

1038 

8 

36 

897 

9 

70 

557 

10 

112 

311 

11 

133 

82 

12 

69 

29 

13 

2 

41 

14 

2 

29 

15 

1 

7 


Total - 500 


TABLE 84 

Bank Clearings in New York Cut (1860-1923) 
Figures in thousands of millions 


1860 

7.2 

1876 

21 6 

1892 

36.7 

1908 

79.3 

61 

. 6.9 

77 

23 3 

93 

31.2 

09 

103 6 

62 

6.9 

78 

19.9 

94 

24 4 

10 

97.3 

63 

14.9 

79 

29.2 

95 

29.9 

11 

92 4 

64 

24.1 

80 

38.6 

96 

28 8 

12 

100.7 

65 

26.0 

81 

49.4 

97 

33.4 

13 

94.6 

66 

28.7 

82 

46.9 

98 

42.0 

14 

83.0 

67 

28 7 

83 

37,4 

99 

60.8 

15 

110.6 

68 

28 5 

84 

31 0 

1900 

52 7 

16 

169.6 

69 

37.4 

85 

28 2 

01 

79 4 

17 

177.4 

70 

27.8 

86 

33.7 

02 

76.3 

18 

178.6 

71 

29.3 

87 

33 4 

03 

66.0 

19 

235 8 

72 

33.8 

88 

31.1 

04 

68 6 

20 

243 2 

73 

35.5 

89 

35 9 

05 

93 8 

21 

194 4 

74 

22.9 

90 

37.4 

06 

104 7 

22 

217.9 

75 

25.1 

91 

33 7 

07 

87.2 

23 

214.0 
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TABLE 85 

MoibTvRE Content ami Heat <» IUiiRation of Fifth Middunos FLOtm (6) 


Per Cent. 

Heat of 
flydration 

Moi^^luro 

in Calories 

(3V) 

Ca-) 

1 7 

IS 3 

2 9 

16 0 

4 2 

12 6 

5 6 

10 9 

6 6 

9 1 

8 1 

7 6 

9 0 

5 9 

10 8 

3.7 

11 r> 

3 2 

14 0 

1 5 

16 3 

0 5 



CIlAl'TEK XV 


THE ANALYSIS OF COVARIANCE 

1. The Heterogeneity of Covariation and the Principle of Covariasic*. 
Analysis. V^e Imve noted from our study of the analysis of YariM»*ee 
that for a single variable tlic vaiiatioii is fretpi<‘utly h('t«;rogeneous and 
may be sorted out into eoiu ponenis (U bTiniiied largely by tlie way j: 
which the data are taken The same ih true for I he cnnlated 
ability or (‘ovariation of two variables, and iJi'* ieerhaniom for ,soitij'‘»; 
out the eovariance effects is krif»wfi as the {(nn!i;s7i' t>J (ovmaint'f hi 
order to think in terms of actual valuta, we may suppose that ilu‘ 
variables are yields of grain and Hravv from i* iciil plots "I'Ik- total 
covariance for grain and straw yields is made up in part by the envan 
ance for the means of the treatments and m part \ty ilu* covariance, witlmi 
the plots of the same variety. 'Fhe d( gt(‘e of coirdation may bo chfTer- 
ent for the two c.oin}»onemiP atul heme tht‘ total corcelaUon is hetern 
geneous. in the same way we ma;v c ousahT ilie covariance for the 
replicate mertns as another comptucnl li- fac| t}if‘ cmnponeiils may 
be taken as exactly equivalent to tlnisc acecading to winch the data ma> 
be classified for an analysis of variance of either variable. 

2. Division of Sums of Products and Degrees of Freedom. Just as 
the analysis of variance ari.^es fnan the fact that the sums of squares ami 
degrees of freedom may be subdi\aded according to iho way in wliieh tlio 
data are cl:x.ssified, the analysis of iovariama^ aiiM*s Iroiii the fact, that 
the suuVs of pioducts of the deviatuuis and ctnTesponding ilegiee.s of 
freedom can be subdivich'd in the same manner. 

Representing a sol of data for Iwu variable^ a.' follows: 

\nU \r 
1 2 ,!/:^ 

•C/ t\!J’ n 

in which there are A groups of n paii.-. of variates of .. ami y 

(rn — X) s^Mi - ^ hri — -r) 

and (y\i - y) ^ i;/ii “ yd -t ^yi - y) 
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Multiplying to obtain a mni^e product of the deviations; 

(*ii — f) (pii — S) * (*u “ *i) (yii “ Si) + (*i — *) (Si ~ S) 

+(®ii — *i) (Si ~ S) + (*i “ *) (fii ~ S>) 

On Bummating for all the pairs in the first group the last two terms dis- 
appear and we have: 

S(i - i)(y - S) * 2.* - *i)(F - Si) + «(*! ~ *) (Si ~ S) 

Then summating over the k groups; 

2(1 - i) (p - S) = 2[2(i - «,) (1/ - Sf)] + « 2(ic - *) (Si - S) (1) 

where £, and Si are group means for x and y. This is the fundamental 
equation for the sums of products on which the analysis of covariance is 
based. If the same data are divided into n classes as well as k groups, 
the equations for sums of products and degrees of freedom are: 

2(x - £Ky -g) + i)(y - Si “ S. + S) 

{nk-1) = (n - l)(t - 1) 

+ n 2(i?, - *)(S. - S) + * 2(i. - «)(S. - S) (2) 
+ (k-1) + (n - 1) (3) 

The method of calculating the sums of products is not according to 
these formulae but by means of equalities similar to those used for cal- 
culating sums of squares. These equalities are described bdow under 
Example 48. 

3. Coefficients of Correlation Corresponding to Sums of Products 
and Squares. Considering the simple classification of the pairs of 
variates into k groups of n pairs, we have the sums of products and corre- 
sponding sums of squares of x and y as follows: 

2(* -£)(y-g)= 2(x - £,)(» - ft) + n 2(ft - i)(ft - g) 

2(x - *)* - 2(* - ft)® + n 2(ft - ft® 

2(f - ft* - 2 (f - ft)® + n 2(J, - ft* 


( 4 ) 
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It IB now dear that each vertical set represents the factors necessary to 
calculate eorrdation or regression coeffidents. Hence we can write: 


Tty (total) 


byx 

DF 

Tgy (within) 


Z(» - f)(y - g) 

Vs(* - f)*2(y - S)» 

^ 2(x - £){y - g) 

S(x - i)® 

-= n* - 2 

^ Z(x - ^)(y - y,) 

yls(x - ^)®2(y - S,)® 


bya — 


2(x - #,)(y - gg) 


nt 

2(x 


-X,)® 


DF = *(« - 1) - 1 

(between) = ~ g>— 

ynX(£t - x)®n2(§f - §)* 

j n2(J, - g)(g, - g) 

n2(^, - «)‘ 

DF = k-2 


(8) 


Note that for each component the degrees of freedom for est imatin g the 
coefiSdents are one less than for the corresponding estimates of the 
variance. 

^ce it can be proved that the variances and covariances for between 
and within groups are unbiassed estimates of the true values for the 
population sampled, it follows that the corresponding coefficients of 
correlation and regression are also unbiassed estimates of the correla- 
tion and regression parameters of the population. They can be used, 
therefore, to test the significance of the covariance effects represented by 
the various components for which they are calculated. One practical 
application of t^ principle will be seen at once. Total correlation 
coeffidents are obviously incapable of definite interpretation if they 
represent heterogeneous covariance effects, and tests of significance 
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applied to them eannot give a clear-cut answer. The eoeificients 
calculated from each component, however, are capable of definite inter- 
pretation. In the simple case of covariance within and between groups, 
if the total covariance is made up largely of the covariance between the 
means of the groups, the total correlation is often referred to as contain- 
ing a spurious effect. By the covariance method this effect is taken 
care of in the calculation of the covariance between the means and is 
completely removed from the covariance within the groups. Thus the 
so-called spurious effect is not only removed but completely evaluated as 
a distinct component of the total. 

4. Applications of the Covariance Method to the Control of Error. 

One of the most important applications of the analysis of covariance is 
in the control of errors that arise at random throughout the experiment 
and cannot be taken care of by replication. In the case, for example, of 
number of plants per plot for such crops as mangels and sugar beets, 
the variations in number of plants arise at random throughout the 
experiment and, so far as they affect the yields of single plots, add to the 
exp('rimental error. Correction of the yields on the basis that 3rield is 
directly proportional to the number of plants is a frequent practice, but 
it is not difficult to demonstrate that yield is rarely if ever proportional 
to the number of plants per plot, and that such an adjustment is likely 
to exaggerate the yield.s of plots in wliich plants are missing. Correction 
on the basis of the exact relation between yield and number of plots as 
indicated by the data is, however, perfectly justifiable, and the method of 
making such a correction is a natural development of the covariance 
technique. Nmnerous applications of the same method will undoubt- 
edly occur to workers in other fields. 

In order to df'inonstrate the control of error by the covariance 
method, wo shall represent a covariance analysis algebraically as follows, 
in which the experiment is presumed to be a randomized block field plot 
test 



DF 

Z(**) 


2 (»*) 

hyg 

6 ».Z(jy) 

S(y'*) 

DF 

Blocks 

D 


A) 

c« 




mm 

Treat- 

H 








menu 

n 


Bi 

Cl 

h - Bi/Ai 

61B1 

Cl - 61B1 

BBB 

Error . 

n 


Bt 

c* 

b% — B1/A2 

btPt 

Ct — 6|Bi 

n — 1 

r + B 

n -^q 

B 

Bi 

B 


biB, 

Ct — bfBi 

1 

n + ^ — 1 
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In the column headings, x is written for (x — f ), y for (y — for the 
regresaon coefficient of y on x,'and 2)(y^} indicates a sum of squares for 
y adjusted by the regression coefficient in the same line. 

Tlie calculations are complete in each line of the table. The regres- 
sion coefficient is B/il, and the adjustment in the sum of squares for y is 
hB or B^/A. In the last line we are considering only treatments and 
error so that At = Ai + A2, Bi = jBi + B2 and Ci « Ci + Cz. 

The second step in the procedure is indicated as follows : 



DF 

S (gq.) 

Variance 

T + E 

n + 9 - 1 

C| — &fB| 


E 

n - 1 

C2 b2B2 

V, 

T 

9 

Cl + b2B2 — htBt 

Vi 

T 

9-1 

Cl — biBi 

Vi 

(61 — 62) 

1 

biBi + 62^2 “ btBt 

Vt 


The first sum of squares for treatments is obtained by differences and, 
since it has not been adjusted by the treatment regression coefficient, is 
still represented by g degrees of freedom. The second treatment sum of 
squares is written down from the first table and is represented by 9 — 1 
degrees of freedom, as it has been adjusted by the treatment coefficient. 
On subtracting the second treatment sum of squares from the first, we 
have a sum of squares given by biBi + hzBz — btBi, and it is not diffi- 
cult to prove the following equality; 

biBi + bsBj - b^, = hfifi + blAz -bU, = -ir4h- - ^ 2 )“ 

Ai + A2 

It follows that when bi = 62 this sum of squares is zero, and that a test of 
significance of the corresponding variance (F4} is a test of the significance 
of the difference between the error and treatment regression coefficients. 

TJie test of significance of the treatment differences after adjustment 
for the regression of y on j involves a comparison of the variances V2 and 
Vi. The fact that Vi may contain a significant effect due to (61 — 62) 
does not vitiate the meaning of the test, as such an effect is obviously due 
to some factor characteristic of the treatments. In the case of yield and 
number of plants per plot, the variety regression coefficient (bi) might 
be higher than (62)1 and this nil! contribute to the significance of Vij but 
62 represents the regression of yield on number of plants within varieties, 
and may be taken as a true measure of the effect of number of plants on 
yield. If the treatment regression coefficient is higher this probably 
reflects an additional genetic relationship, and one that should contribute 
to the significance of the differences between the varieties. A further 
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test may be apidied, however, to Va, and by a comparison of the signif* 
icance of Va and Va a complete picture of the variety effects is obtained. 
The value of such an analysis, if, for example, number of roots has a 
significant effect on yield, is that the error variance and variety variance 
will be reduced proportionately with a consequent increase in the signif- 
icance of the variety differences, if such differences exist. If the anal- 
ysis of the unadjusted yields shows significant differences when the 
adjusted yields do not, this simply means that the original differences 
were due to number of roots and not to the yielding characteristics of the 
varieties as measured by average yield per root. 

R. A. Fisher (4) has pointed out that an appropriate scale for measur- 
ing the effectiveness of methods of reducing the error is the inverse of 
the variance. This is sometimes called the invariance and is represented 
by 1/F. In measuring the reduction of error by means of the covariance 
analysis, this scale is particularly useful. Example 48 is a good illus- 
tration of this point. The original error variance is about three times as 
large as the final error variance obtiuned by adjusting the sums of squares 
for two associated variables. In other words, in the original form with- 
out any adjustment about three times as many replications would be 
required to give the same accuracy as the adjusted values. One should 
not reason from this that the significance of the differences between the 
treatments will be increased accordingly, as it must be remembered that 
at the same time differences between the treatments due to, the associ- 
ated variables are also being removed. 

The test of significance having been applied as outlined, the next step 
is to make an actual correction of the variety means. Since the regres- 
sion coefficient in the error line may be considered as representing the 
actual effect of number of roots on yield, this regression coefficient 
should be used for making corrections. The corrected means should 
then be the best possible estimates of what the means would havq been 
if they had not been affected by variations in number of roots. The 
regression equation will be of the form: 

Fi = — £) (7) 

where is the mean of x for one variety, is the mean of y for the same 
variety, is the regression of y on x in the error line, and Yi is the 
estimated mean of the variety. 

To compare two corrected means such as Yp and Yg we must use for 
the standard error of the difference between two means 
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where ^ is the variance in the error line of the analysia of covariance 
table (for example, in Table 87 it will be 7581.3/35 » 219.5), A is the 
sum of squares for x in the same line, r is the number of replications, and 
(£p — £,) is the difference between the two means used in the two 
expressions for calculating Y, and Thus 

Yp — yp ~ hpmQtp ~ 4) and Yq * 

In comparing two means corrected for two variables xi and xa we 
calculate the standard error of a mean difference as follows 

, 2 u‘B — 2uvP + tr^A 

*^ 1 ; + AH-p^ 

where A and B are the sums of squares in the error line for xi and X 2 . 

P is the sum of products for xi and X 2 in the error line. 

u = (xip — xi*), difference between xi means, 

V = (fep — £ 2 ff)i difference between X 2 means. 

The method of error control by means of two or more associated 
variables is described in Example 48. 

6. A Test of the Heterogeneity of a Series of Regression Coefficients. 
The analysis of covariance provides a Unique technique for testing the 
significance of the differences between two or more regression coefficients. 
Using the same symbolism as in the previous section, the procedure is as 
given*below. 


Group 

DF 

Df**) 

Z(xi/) 

SCv*) 

byx 

6y,Z(zy) 

2(y'*) 

DF 

1 

n 

Ai 

Bi 

Cl 

hi - Bi/At 

hiBi 

Cl — 61B1 

q - 1 

2 


A2 

Bt 

c* 

ht - Bi/Ai 

btBt 

Ct — *>2^2 

q-\ 

8 

i 

At 

Bt 


h, - B,/A, 

hBi 

Ct - btBt 

9- 

P 

1 

Ap 

Bp 

H 

hp ^pfAp 

hpBp 

Cp - bpBp 

9 - 1 

Total 

P9 

At 

Bt 

Ct 


biBi 

Ct — btBi 

M - 1 
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DF 

w 

Variance 

Total 

pq - 1 

Ct — btBt 


Within groups. . 

p(? - 1) 

2 (C - bB) 

V\ error variance 

Difference . . 

(p - 1) 

im - hiB, 

V 2 due to differences between 
regression coefficients 


The last sum of squares may be shown to be 


t(hB) - biBt 


\Ai + -^2 + • * • + A, 


where bj and bk represent all possible pairs of the regression coefficients and Al- 
and Aib all possible pairs of the corresponding sums of squares for x 

The comparison of Vi and V 2 by means of the z test furnishes therefore 
the required test of the heterogeneity of the regression coefficient. 

Example 48. For the sake of brevity this one example will be used to demonstrate 
most o f the iipporfant ap plications of thfi f invauftPce te chpiq iift^ Data are given by 
Grarnpton and Hopkins (1) on weights, gains, and feed consumption in a comparative 
feeding trial These data are repioduced in Table 86 for initial weight, feed eaten, 
and final weight The analysis is concerned with expressing the results for final 
weight corrected for variations in initial weight, corrected for variivtions in feed 
eaten, and coriected for initial weight and feed eaten The last is an application of 
the method of partial regression which is described in detail in the paper by Crainploii 
and Hopkins In addition a test will be illustrated of the significance of the dif- 
ferences between the regression coefficients for each treatment. 

(1) E ffect of J n-Uinl nn Finnl Wogbi The analysis of covariance is set 

up in the form shown in Table 87 In performing the calculations for such a table, 
it 18 recommended that the sums of squares, sums of products, and totals be obtained 
by treatments, as it is necessary to keep these separate if certain tests ar^ to be 
employed at a later stage In obtaining the sums of products it should be noted 
that a procedure may be followed exactly analogous to that for obtaining sums of 
squares With k replications of n treatments, the sums of products are given as 
follows: 

Total. X{z - x)(y - « ^{xy) - T^Ty/N 


Between means of treatments 2(^1 — y) * 'LiXtxTty)/k — TxTy/N 

k t 

Between means of replicates n 2(*r — f)Wr — p) * Ji{Tfjrry)/n - TgTy/N 

Residual or enor Total — (t^’eatmeiits) — (replicates). 

Where Ttz and Tty are treatment subtotals for j: and y 
and Tn and Try are replicate subtotals for x and y. 
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Lot V 

Final 

Weight 

C«|tO*AOOOOQOCDU3 

Feed 

Eaten 

TO>oc^c4^tx»coesi>^>o 

aOr«i>r^cOQOt>-^r^co 

Initial 

Weight 

41 

36 

32 

35 

32 

36 

30 

28 

32 

26 

Lot IV 

Final 

Weight 

g SI S2 S fe 2 t § 8 S 

Feed 

Eaten 

coaieoc4^oc<iQOoo.-H 

^COCO'^eM«-H'-44«f4C'lQ 

Initial 

Weight 

rttCOCOCOCOCOCOMCCCM 

Lot III 

Final 

Weight 

COO^-icOiOkOO^^Q 

oo>c^t'-uoc4ador>-oo 

Feed 

Eaten 

aO'-t«cocoi'-hOL^Cvio»co 

0^co<£)Q'-rcQcpQo> 

r- <o I'- CO S I'o qp o QD uD 

Initial 

Weight 

39 

34 

32 

35 

32 

36 

30 

29 

32 

25 

Lot II 

Final 

Weight 

'^'^Oi-HlOt’^r-iOOOGO 

^0000)00000^ 

"2 S 

ft, 

699 

626 

668 

668 

707 

651 

672 

660 

769 

666 

Initial 

Weight 

26 

24 

20 

35 

25 

26 

20 

31 

29 

27 

Lot I 

Final 

Weight 

8{^88SgS88S 

i-H 

Feed 

Eaten 

■^ooe-^^co^iogoc^t^ 


Initial 

Weight 

SSScoRSSScSS 

Repli* 

cate 

^C4CO'««<dcOt>-QOOO 
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TABLE 87 

Analysis of Coyariancb — ^Fznal Wxxoht and Initial Weight 
x$ "* initial weight xi >■ final weight 



(1) 

DF 

(2) 

(3) 

Z(xizi) 

(4) 

2(*?) 

(«) 

bu 

(6) 

buS(zlXl) 

(7) 

(8) 

DF 

(9) 

ri* 

Replicates 

n 

484.4 

752 0 

2,487 2 

1 6549 

1,244 5 

1,242.7 

8 


Tr^tments 

H 


1,172 2 

5,741 7 

2.3016 

2,697 9 

3,043 8 

3 


Error 

1 

368.4 

1,001 8 


2.7193 

2,724 2 

7,681.3 

35 


Treatments 










+ Error 

40 

877 6 

2,173.8 

16,147.2 

2.4770 

5,384 5 

10,762 7 

39 



(1) DF for unadjusted sums of squares. 

(5) bia item in col. (3) divided by item in col. (2). 

(6) biaZ(xiX|) - col. (5) X col (3) or col. (3)Vcol. (2). 

(7) « adjusted sums of squares » col. (4) — col. (6). 

(8) DF for adjusted sums of squares. 

(9) Correlation coefficient (unnecessary for tests of significance). 

From Table 87 we can proceed to the test of significance of the treatment dif- 
ferences adjusted for initial weight and of the difference between the treatment and 
error regression coefficients 



DF 

S (Bq.) 

Variance 

F 

5% Point 

Treatments + Error. . . 

39 

10,762 7 




Error. . . 

35 

7,681 3 

219 5 



Difference « Treatments 

4 

3,081.4 

770.4 

3 51 

^ 64 

Treatments 

3 

3,043 8 

1,014 7 




1 

37.6 

37 6 




Since the difference between the error and treatment regression coefficients 
(bs — bt) is obviously insignificant the tests of significance are not carried any 
further. 

To adjust the means of the treatment final weights for the initial weights we use 
the equation given above which in terms of the symbols now being used will be 

*11 *11 — bi8(*|3 — *i) 

(2) Effect of Feed Eaien on Final Weig ht. The procedure is exactly the same as 
above so wiU be given in tabular form only. 
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TABLE 88 


Analtbis of Covauancb— Fekd Eaten and Final Weight * 
>■ feed eaten xi «■ final wei^t 



DF 

2(4) 



5i2 

&12Z(eiEs) 


DF 

ri2 

Replicates. 

9 

35,150.3 

8,774 1 

BiHB| 


2,190.19 


8 

0.9384 

Treatments 

4 

28,404.9 

11,596.5 



4,734.39 


3 

mm 

Error ... . 

36 

90,792.3 

24,508 7 


3B 

6,615 88 

3,789.6 

35 

0.7974 

Treatments 










-h Error 

40 

119,197 2 

36,105.2 

16,147.2 


10,936.26 

5,210.9 

39 




DF 

2(y'*) 

Variance 

F 

5% Point 

Treatments + Error. . . . 

39 

5210 9 




Error 

35 

3789 6 

108 3 



Difference — Treatments 

4 

1421.3 

355.3 

3 28 

2 64 

Treatments 

3 

1007.3 

335 8 

3 10 

2 87 

Difference — (6« — 6<). . 

1 

414 0 

414.0 

3.82 

4.12 


There is an indication here of a difference between the regression coefficients for 
treatments and error but it is hardly significant. 

(3) E^ect of Initial Weigh t and Feed Eaten o n F inal Weight . After obtaining the 
separate sums of squares for each variable and the sums of products for the three ways 
in which the variables can be paired the next step is to determine the partial regression 
coefficients. For three variables the sums of squares and products give two simul* 
taneous equations as illustrated in Chapter VIII. These equations contain the 
partial iregression coefficients as unknowns and can be most easily solved by the 
normal equation method, also described in Chapter VIll. The remainder of the 
calculations are as in Tablo 89. 

TABLE 88 

Analysis of Covariance — ^Effect or Initial Wbigbt and Feed Eaten 

ON Final Weight 



Z(*J) 

DF 

s5i2E(EiE|) 


2(y'*) 

DF 

Replicates 

2,487.2 

9 


iii 



Treatments . . 

6,741.7 

4 

4002.8 


778.9 

2 

Error 

10,405.6 

36 

5910.9 

988 7 

3505.9 

34 

Treatsoents + Error 

16,147.2 

40 

9411.6 

m 

4471.7 

38 
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TABLE 89 — Coniinu/ed 

Analysis of Covarianciq — ^Effect of Initial Weight and Feed Eaten 
ON Final Weight 




DF 

Variance 

F 

5% Point 

Treatments + Error 

HE 

38 




Error. 

IB 

34 

103 1 



Difference » Treatments 

965 8 

4 

■SI 

2 34 

2 64 

Treatments 

778 9 

2 


3 78 

3 26 

Difference 

186 9 

2 

93 4 




The final lesult is rather unusual in that the treatment variance corrected by 
its own regression coefficient is significant while the treatment variance as obtained 
by differences is insignificant This seems to be traceable to the relations between 
xy and X 3 where, as will be noted in Table 87, the difference between the regression 
coefficients is much less than would be expected on the basis of random sampling. 

The equation for correcting the mean final weights will now be 


-JT/i = Sti — 3612 — ^2) — 2hi3{fi3 — is) 


wheie 3^12 and are the partial regression coefficients for the error covariance 
(4) Test of JfeteroQenatu Jlf..Cof^art(ition or the Sigm^car^ of the Difcrence s 
betiwen Regression Coeffiaents Calculated for Each Group, If for the above example 
we have kept our raw sums of squares andT prodiicts separate for each treatment 
we can very quickly set up the results as in Table 90, showing the sums of squares 
and products for z\ and tz, the regression coefficients for each group, and finally the 
adjusted sums of squares for x\. 


TABLE 90 

Test of Heteuogeneitt or Regression between Treatments 



DF 

2 (*i) 

S(XlX3) 

2 (x?) 

bu 



DF 

Loti 

9 

168 9 

458 5 

2,020 5 

2 7146 

1244 6 

775 9 

8 

Lot 11 . 

9 

192 1 

102 6 

715 6 


54 8 

660 8 

8 

Lot III 

9 

132 1 

169 4 

2,869 6 

1 2824 

217 2 

2752 4 

8 

Lot IV 

9 

158 1 

333 7 

1,964 9 


704 3 

1260 6 

8 

Lot V. 

9 

191 6^ 

689 6 

5,722 1 

3 5992 

2482 0 

2740 1 

8 

Total. 

45 

842 R 

1753 8 

12,892 7 

2 0809 

3640 5 

9243 2 

44 
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TABLE 90— ~Conftfiii6d 

Tsst or Hetbrooenbity or REORBasioN between Treatments 



DF 

2(y'’) 

Variance 

F 

5% Point 

Total . . . . 

•Treatments 

44 

40 

9243 2 
8189 8 

204 7 

1 29 

2 61 

Difference 

B 

1053 4 

263 4 




For the test of significance we sunimate the adjusted sums of squares for each 
treatment and subtracting from the total obtain a sum of squares corresponding to 
4 degrees of freedom representing differences between the 6 regression coefficients. 
In this example there is no evidence of significant heterogeneity of regression. 

6. Exercises. 

1. The data given in Table 91 are grain and straw yields given by Eden and l^’isher 
(2) for 8 manurial treatments and 8 replicates of each Calculate the correlation and 
regression coefficients for treatments, replicates, and residual Test the significance 
of the grain yield differences for the treatments after correction for straw yield. Test 
the significance of the difference between the regrc^asion coefficients for treatments 
and residual, and apply the test for heterogeneity to the regression coefficients 
calculated for each treatment. 
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CHAPTER XVI 


MISCELLANEOUS APPLICATIONS 
L THE ESTIMATIOn OF MISSINO VALUES 

1. Reasons for Estimating Missing Values and Principles of Esti- 
mation. In most experimental work, and especially in field plot studies, 
the results of one or more observations are occasionally lost or distorted 
by some disturbing factor in such a way as to make the particular 
observations useless. In the laboratory it may be possible to repeat a 
portion of the experiment and obtain new values for those that are miss- 
ing, but in field experiments repetition is impossible and one has to make 
the best of the results available. In other Uological experiments 
it is frequently impossible to repeat under the identical conditions 
of the original experiment, and methods of estimating missing or 
distorted values are preferable to discarding the whole or a portion of 
the data. 

A method of estimating the yields of misnng plots in field experiments 
on a strictly statistical basis was first developed by Allan and Widiart 
(1). Their methods were developed for the estimation of one missing 
yield; but more recently Yates (3) has extended their methods to the 
estimation of the yields of several missing plots. Since the methods 
developej^ by Yates are of general application, we shall use them through- 
out, although for single missing plots they are identical with those of 
AllSKi and Wishart. The mathematical basis of the method of estimat- 
ing missing values is the substitution of a value for the one missing that 
will make the'sum of the squares of the deviations from the mean a mini- 
mum. Equations are written for the sum of squares substituting x 
for the nussing value; and after minimiang, the equations are solved 
for X. 

2. Estimation of Missing Yields in Randomized Block Experiments. 
The data are first arranged in a table according to treatments and 
blocks. Table 92 is an example of an experiment with 6 treatments 
in 4 randomized blocks, and 1 plot of treatment B of block II is miss- 
ing. 
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TABLE 92 


Treatments 


Blocks 

A 

B 

C 

D 

B 

F 

Total 

I. . . 

18 5 

15 7 

16 2 

14 1 

13 0 

13 6 

91.1 

II. . . 

11 7 


12 9 

14.4 

16 9 

12 5 

68 4 - Q 

III.. . 

15 4 

16 6 

15 5 

20 3 

18 4 

21 6 

107 8 

IV . 

16 5 

18 6 

12 7 

15 7 

16 5 

18 0 

98 0 

Total 

62 1 

50 9 
* P 

57 3 

64 5 

64 8 

65 7 

365 3 - r 


In the generalized formula for x, the yield of the missing plot: 
p = number of treatments, 
q = number of blocks, 

P = total of all the plots receiving the same treatment as the 
missing plot, 

Q = total of all the plots in the same block as the missing plot, 

T = total of all plots. 


The formula is: 


pP + qQ - T 
(p - l)(q 1) 


( 1 ) 


In Tabic 93 wo have the same data as in Table 92 except that now 
three plots are missing. 


TABLE 93 


Treatments 


Blocks 

A 

B 

B 

D 

E 

B 


I ... 

18 5 

15 7 

16 2 

14 1 

13 0 

13 6 

91.1 

II... 

11 7 

B 

12 9 

D 

16 9 

12 5 

54 0 

III.... 

15.4 

16 6 

15.5 

EX] 

18.4 

21 6 

107.8 

IV. . 

A 

18.6 

12 7 

15.7 

16.5 

18.0 

81 5 

Total. . 

45 6 

50 9 

57 3 

m 

64.8 

65.7 

334 4 


The procedure in such an example where more than one observation 
is missing is first to substitute approximate values for all the missing 
values except the one to be estimated. We then apply the missing- 
plot formula as given above. The same process is in turn applied to all 
the missing plots. The results given are first approximations, and the 
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whole process is repeated until the estimated values becmne practically 
constant. 

Ihe methods are illustrated below for the estimation of the Tniiwinp 
values in Table 93. 

First Approxibution 

Average yield = 334.4/11 = 15.9, 

The r = (334.4 + 2 X 15.9) = 366.2. 

Here the average yield of the plots is used as an approximation of the 
}rields of two of the three miss i ng plots. 

A. P = 45.6 Q = 81.5 

a; = (6 X 45.6 + 4 X 81.5 - 366.2), 15 = 15.6 

B. P = 50.9 Q = (54.0 + 15.9) = 59.9 

T = (6 X 50.9 + 4 X 69.9 - 366.2),. 15 = 14.6 

Note that here we have to substitute a value for D and that the mean of 
all the plots is taken as the best approximation. 

JJ. P = 50.1 Q = (54.0 + 14 6) = 68.6 

T = (6 X 60.1 + 4 X 68 6 - 366.2) 15 = 13.9 

Here we have to substitute a value for B, and the previoasly estimated 
value is takcA as the best approximation. 

Second Approximation 

A. T = 333.4 + 14 6 + 13.9) = 362.9; P = 45.6; Q = 81.5; 

a- = (6 X 45.6 + 4 X 81.6 - 362.9), 15 = 15.8. 

In all the approximations after the first a new value for T is worked out 
for the estimate of each plot, using the estimates from the previous 
approximation. To get P and Q it is best to substitute for the missing 
plot values where necessaiy, the latest values obtained. 

3. Estimation of Missing Yields in a Latin Square. The best 
arrangement of the data is in a table such that the positions of the figures 
correspond with the pasitions of the plots in the field. The treatments 
should also be indicated on the table in the exact positions that they 
occur. 

The formula for estimating x the yield of a missing plot is : 

» p(Pr + P. + P.) - 2r 
* (P - l)(p - 2) 


( 2 ) 



M4 


MISCELLANEOUS APPLICATIONS 


where Pr » total of row containing the miaaing |dot. 

P« » total of column containing the miamng plot. 

Pf » total of treatment containing the missing {dot. 

T » total of all plots. 

p » number, of rowSi columns, and treatments. 

If more than one plot is missing, we proceed exactly as for randomized 
blocks, substituting approximate values for the plots not being estimated 
and making continuous applications of formula (2). 

4. Correction to Analysis of Variance Due to Estimation of Missing 
Values. The estimation of missing values for a set of results introduces 
a complication in the analysis of variance. In the first place, one DF 
must be removed from the total for each missing value; and in the sec- 
ond place a correction must be applied to the sums of squares for treat- 
ments or any other component in the anal3rsis, the significance of which 
is to be tested against the error. An exact mathematical solution of this 
problem for all cases has been provided by Yates (3), but except for 
randomized block experiments, and for Latin square experiments with 
only one missing plot, it is rather complicated for general practice. 

In a randomized block experiment as in Table 93, for which three of 
the missing plot yields were estimated, the following scheme for the 
analysis of variance shows how the correction is applied to the treatment 
variance. In this scheme the *'ori^nal” values refers to those for the 
21 plots as given in Table 93, and the ^'completed” values refers to those 
in Table 93 with the addition of the three that were estimated. 



DF 

Sum of Squares Calculated from 

Total 

20 

Original yields 

Error 

12 

Completed yields 

Difference » Blocks + Treatments. . 

8 


Blocks 

5 

Original yields 

Difference Treatments 

3 



The procedure for calculation is as follows: 

(a) Obtain the sums of squares for blocks, treatments, and error 
from the completed yields. 

(b) Obtain total sum of squares for original yields. 

(c) Obtain sum of squares for blocks from original yields, noting 
that not all the blocks contain 6 plots. 
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(<0 Set up the anal3r^ of variance as above, obtaining the sums ci 
squares first for Uoclm + treatments and then for treatments by 
subtraction from the known quantities. 

For Latin square experiments with only one plot missing the Amplest 
method of determining the correction to the treatment sum of squares is 
to use the formula 


which gives the correction directly. The scheme of analysis using a 
6X6 Latin square would then be as follows: 


DF Sum of Squares Calculated from 


Total . . ... 

34 

Original values 

Error 

19 

Completed values 

Difference ■= Rows — Columns — 



Treatments 

15 


Treatments — Correction . 

5 

Calculate from complete 
values and subtract 



correction 


6. Correction of Treatment Means and Standard Errors. The 
treatment means that contain estimated values for missing plots are in 
effect corrected means and further corrections are not required. The 
standard errors of such means, however, require a definite correction, 
and for methods of doing this accurately the reader should refer to the 
paper •by Yates (3). For general purposes it is probably sufficient to 
inake a correction for the number of plots averaged, i.e., if there are r 
replications and one plot is missing the standard error of the mean of the 
treatment containing the missing plot will be 



n. METHODS OF RANDOMIZATION 

Randomization can be effected by tossing coins, drawing cards out of 
a shuffled deck, throwing dice, etc., but these methods are too slow and 
in general too inaccurate for actual practice. The problem has been 
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greatly simplified by the preparation of Tippett’s ^'Random Sampling 
Numters” (2), and these numbers are now in general use.* 

If we have a series of numbers 1, 3, * - • n, the problem of random* 
ization is to arrange these numbers in such a way that in forming the 
arrangement any one of the numbers has an equal chance with any other 
number of being placed in a given position. A procedure that is fie* 
quently followed in arranging field plot tests may now be described 
briefly. 

Suppose that the numbers representing the varieties are 1, 2, 3, 4, 5, 
6, 7, 8, 9. Turning to page XI of Tippett’s “Tables” (the usual prac- 
tice being to open the book more or less at random), we find that begin- 
ning at the upper left-hand comer we can take a series of random two- 
figure numbers as follows, 40, 81, 89, 58, 87, 74, etc. Assume now that 
there are 9 places to be filled up by the numbers 1 to 9, and the first one 
is selected by dividing the first two-figure number by 9 and taking the 
remainder. Thus for 40/9, the remainder is 4, and number 9 is placed 
in the fourth place. The second number to be placed is 8 and we divide 
the second two-figure number by 8; 81/8 gives a remainder of 1, and 8 is 
placed in the first place. The third number is 7, and dividing it into 89 
the remainder is 5, and 7 is placed in the fifth space counting only those 
that are empty. This procedure is followed until all the numbers have 
been placed and we get finally the following arrangement: 

8, 3, 5, 9, 4, 6, 7, 2, 1 

The same procedure can be modified for application to a Latin square, 
but in that case it is only necessary, starting with a given Latin square 
which may be made up systematically, to randomize the rows, columns, 
and treatments. 



TABLES 


TABLE M 
Tabub or (* 


DegKM of 
Fieedom 

1 Plobability 

0.50 

0.10 

0.05 

mm 

mm 

1 


6.34 

12.71 

31.82 . 

63.66 

2 

0.816 

2.02 

4.30 

«.B6 ’ 

0.02 

3 

.765 

2 35 

3 18 

4.54 

5.84 

4 

.741 

2.13 

2 78 

3.75 

4 60 

fi 

.727 

2 02 

2 57 

3 36 

4.03 

6 

.718 

1 04 

2.45 

3.14 

3.71 

7 

.711 

1.00 

2 36 

3 00 

3 50 

8 


1 86 

2 31 

2 90 

3 36 

0 

.703 

1.83 

2.26 

2 82 

3 25 

10 

.700 

1 81 

2 23 

2 76 

3.17 


.607 

1 80 

2 20 

2.72 

3.11 

12 1 

.605 

1.78 

2 18 

2 68 

3 06 

13 

.694 

1 77 

2 16 

2 65 

3 01 

14 

.602 

1 76 

2 14 

2 62 

2.08 

15 

.691 

1.75 

2 13 

2 60 

2 05 

16 

.690 

1 75 

2 12 

2 58 - 

2 92 

17 

.680 

1.74 

2 11 

2 57 

2 00 

18 

688 

1 73 

2.10 

2 55 

2 88 

10 

.688 

1 73 

2 09 

2 54 

2 86 


.687 

1 72 

2 00 

2 53 

2 84 

21 

.686 

1 72 

2 08 

2.52 

2 83 

22 

.686 

1 72 

2 07 

2 51 

2 82 

23 * 

.685 

1.71 

2 07 

2 50 

2 81 

24 

685 

1.71 

2 06 

2 40 

2 80 

25 

684 

1 71 

2 06 

2 48 

2 70 

26 

.684 

1 71 

2 06 

2.48 

2 78 

27 

.684 1 

1 70 

2.05 

2 47 

2 77 

28 

.683 

1 70 

2.05 

2 47 

2 76 

20 

.683 

1 70 

2 04 

2 46 

2 76 

30 

.683 

1.70 

2 04 

2 46 

2 75 

35 

.682 

1.69 

2 03 

2 44 

2.72 

/O 

.681 

1.68 

2.02 

2 42 

2 71 

45 

.680 

1.68 

2 02 

2 41 

2 69 

50 

.670 

1.68 

2.01 

2 40 

2.68 

60 

.678 

1 67 

2 00 

2 39 

2.66 

70 

.678 

1 67 

2 00 

2.38 

2 65 

80 

.677 


1 90 

2 38 

2 64 


.677 


1 00 

2 37 

2 63 

100 

.677 


1 08 

2 36 

2 63 

125 

.676 


1 98 

2 36 

2 62 

150 

.676 


1 98 

2.35 

2.61 


.675 

1.65 

1 07 

2.35 

2.60 


.675 

1 65 

1 07 

2.34 

2.50 


.675 

1 65 

1 97 

2 34 

2.50 


.674 

1 65 

1 96 

2 33 

2.50 

1000 

.674 

1.65 

1.06 

2 33 

2 58 

GP 

.674 

1 64 

1.06 

2.33 

2 58 


* Tha CNatar porlioB of thtatoUa tokoa fram R. A Flohor’o “Btotioueal Mrthodi for Rauarah 
Wooilon, onth thopM m lw i oa of tho author ood hit pobliahow, Ohoor mod Boyd, London 
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TABLE 06 
TaBLB of 3^ • 


Degrees 


Probability 


of 

Freedom 

m 

0 95 

0.50 

m 

m 

m 

0 06 

0.01 

1 



0.46 


1.64 

2.71 

3.84 

6 64 

2 



1.39 

BM 

3.22 

4.60 

5 99 

9.21 

3 

0 116 


2 37 

ns 

4.64 

6.26 

7 82 

11 34 

4 

0.30 


3.36 

4 88 

5 99 

7.78 

9.49 

13 28 

5 

0 66 

1 14 

4 35 

6 06 

7 29 

9 24 

11.07 

15.09 

6 

0.87 

1.64 

5 35 

7.23 

8.56 

10 64 

12.69 

16 81 

7 

1.24 

2.17 

6.35 

8 38 

9.80 

12 02 

14.07 

18 48 

8 

1.66 

2 73 

7.34 

9 52 

11.03 

13.36 

15.61 

20.09 

0 

2.09 

3.32 

8 34 

10.66 

12 24 

14.68 

16.92 

21 67 

10 

2 56 

3.94 

9.34 

11.78 

13.44 

16.99 

18.31 

23.21 

11 

3.05 

4 58 

10.34 

12 90 

14 63 

17.28 

19.68 

24.72 

12 

3.57 

5 23 

11.34 

14 01 

15 81 

18.55 

21.03 

26 22 

13 

4 11 

6 89 

12.34 

15 12 

16 98 

19.81 

22 36 

27 69 

14 

4 66 

6 67 

13 34 

16 22 

18 15 

21 06 

23 68 

29 14 

16 

6 23 

7 26 

14 34 

17 32 

19 31 

22.31 

25 00 

30 58 

16 

5 81 

7 96 

15 34 

18 42 

20 46 

23 64 

26 30 

32 00 

17 

6.41 

8 67 

16 34 

19 51 

21 62 

24.77 

27 59 

33 41 

18 

7 02 

9 39 

17 34 

20 60 

22 76 

25 99 

28 87 

34 80 

19 

7 63 

10 12 

18 34 

21 69 

23 00 

27.20 

30 14 

36 19 

20 

8 26 

10 85 

19 34 

22 78 

25 04 

28 41 

31 41 

37 57 

21 

8.00 

11.69 

20 34 

23.86 

26.17 

29.62 

32.67 

38 93 

22 

9.64 

12 34 

21 34 

24 94 

27.30 

30.81 

33 92 

,40 29 

23 

10.20 

13.09 

22 34 

26 02 

28 43 

32 01 

35 17 

41 64 

24 

10.86 

13.86 

23 34 

27.10 

29 55 

33 20 

36 42 

42 9th 

26 

11 62 

14 61 

24.34 

28 17 

30 68 

34.38 

37 65 

44 31 

26 

12.20 

16.38 

25.34 

29 25 

31.80 

35.56 

38 88 

46.64 

27 

12.88 

16 15 

26.34 

30 32 

32 91 

36 74 

40 11 

46.96 

28 

13.56 

1G.03 

27 34 

31 39 

34 03 

37.92 

41.34 


20 

14 26 

17 71 

28 34 

32.46 

35.14 

39.09 

42.66 


30 

14 95 

18 49 

29 34 

33 53 

36 25 

40.26 

43 77 

m 


* Taken from R. A. Fieher'e “Statietical Moihodi for Reaeareh Workera/* with the pormiaeion 
of the author and the publiahere, Oliver and Bovd, London. 
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